Five Letter Words: Difference between revisions

Latest revision as of 05:04, 20 June 2026

Hit me with the word list, daddy-O! https://git.charlesreid1.com/cs/five-letter-words/raw/branch/master/sgb-words.txt

What Is It

Five letter words is a set of (surprise!) five letter words, created by Donald Knuth as part of the Stanford Graph Base. This set of words contains 5757 common five-letter words, which meet the following criteria:

no proper nouns
no punctuation, hyphens, or accent marks
no extremely rare words that would only be useful to Scrabble players

Code

Several exercises from Art of Computer Programming ask us to manipulate/analyze the five letter words in various ways. Volume 4 exercises 26 through 37 are recommended by Knuth as warm-up exercises for interacting with and getting to know the five-letter-words list.

Code for each of these exercises is contained in the repository here: https://git.charlesreid1.com/cs/five-letter-words

These are not extraordinarily difficult problems (each took less than 10 minutes to implement), but some of them do take a while to run, and a few get more complicated (need to utilize Algorithms/Dynamic Programming techniques).

In the text, Knuth also mentions letter coverage - finding the number of five-letter words that can cover the first N letters of the alphabet. This is a more complicated task that requires some dynamic programming. See Letter Coverage.

Utility

get_words.py

A simple utility module that loads the SGB words from sgb-words.txt and returns them as a Python list of strings. All other scripts import get_words() from this module. It reads the file, splits on newlines, and drops the trailing \n from each line.

Warm-Up Exercises (AOCP Volume 4, Exercises 26–37)

These scripts correspond to Knuth's suggested warm-up exercises for working with the word list.

distinct.py (Exercise #27)

Question: How many SGB words contain exactly k distinct letters, for 1 ≤ k ≤ 5?

Approach: Iterates over all words, converts each to a set of characters, and bins them by the size of that set (1 through 5). Prints the count and first five examples for each k.

Words with 1 distinct letter: 0 (no five-letter words use the same letter five times)
Words with 5 distinct letters: the majority of the list

diff_by_one_fixed.py (Exercise #28)

Question: Find all pairs of SGB words that differ by exactly one letter in each position (e.g., "rover" and "spuds": r→s, o→p, v→u, e→d, r→s — each position shifts by exactly 1).

Approach: Uses recursive backtracking to generate all candidate strings that are ±1 away from a given word at each letter position, then checks which candidates exist in the word set using a hash table. There are 38 such pairs in the SGB.

diff_by_n_fixed.py (Exercise #28 Variation)

Generalizes diff_by_one_fixed.py to words that differ by a distance of ±d at each letter position, for arbitrary d. Uses the same backtracking-and-hash-lookup strategy, corrected from an earlier buggy approach.

palindromes.py (Exercise #29)

Question: Find all SGB words that are palindromes (e.g., "kayak"), and all palindrome pairs (two words that are reverses of each other, e.g., "regal" and "lager").

Approach: Defines is_palindrome() — checks that positions 0=4 and 1=3 match. Defines is_palindrome_pair() — checks that word1 reversed equals word2. Iterates over all words (and all pairs) to collect matches. Palindromes are excluded from palindrome pairs (a word is not a "pair" with itself).

lexico.py (Exercise #30)

Question: Find all SGB words whose letters appear in strictly increasing lexicographic order (a ≤ b ≤ c ≤ d ≤ e), like the word "first". Identify the first and last such words alphabetically.

Approach: Uses all(a <= b for a, b in zip(word, word[1:])) to check each word. Prints all matching words, the count, and the first/last alphabetically.

euclidean_distance.py (Exercise #31)

Question: Compute the Euclidean distance between five-letter words, treating each letter as a coordinate (a=0, b=1, ..., z=25).

Approach: Converts words to 5-dimensional vectors via word2vec(), then computes the L2 norm (square root of sum of squared differences). Randomly samples 100 word pairs, sorts by distance, and prints the farthest pairs. This reveals which words are "farthest apart" in letter space (e.g., "zyzzy" vs. "aaaaa").

stats.py

Question: Compute letter and bigram frequency statistics across all 5,757 words.

Approach: Uses Python's collections.Counter to tally single-letter frequencies (how often each letter a–z appears across all positions) and bigram frequencies (how often each adjacent pair of letters appears). Results are sorted from least to most frequent and printed. Useful for understanding the distribution of letters in common English five-letter words — e.g., 'e' is the most common letter, 'q' and 'z' are rare.

tries.py (Exercise #35)

Question: Which letters of the alphabet can serve as the root of a complete binary trie within WORDS(n) — that is, a trie (prefix tree) where every path of length 5 from root to leaf represents an SGB word, and every internal node has at least 2 child branches?

Approach: Implements a TryTrieTree class with recursive assembly (top-down, checking whether enough words exist at each depth to sustain a binary tree) and a bubble-up pass (bottom-up, pruning branches with fewer than 2 words). Then tests each letter a–z as a root. Also includes trie_table(), which computes how the number of "perfect tries" grows as n (number of words considered) increases from 1000 to 5757. Knuth's Volume 4 discusses this in detail as a combinatorial structure over the word set.

Variations (Extensions Beyond the Exercises)

These scripts extend or vary the warm-up exercises.

diff_by_n.py

A cleaned-up and generalized version of the "diff by one" exercise. Finds all pairs of SGB words whose letters differ by ±d at each position, for d = 1, 2, 3. Uses recursive backtracking to generate candidate words and a set-based lookup. Reports the number of pairs found and the time taken for each distance.

near_palindromes.py (Exercise #29 Variation)

Question: Find words that are near-palindromes — words whose first/last and second/fourth letters are within a small edit distance of each other, even if they are not exact palindromes.

Approach: Computes the sum of absolute differences between positions (0,4) and (1,3). Words with a total distance ≤ 1.0 are classified as near-palindromes. This catches words like "about" (a/t differ by 19, b/u differ by 19 — not near) versus words where pairs are only one or two away.

reverse_lexico.py (Exercise #30 Variation)

Question: Find all SGB words whose letters appear in strictly decreasing lexicographic order (a ≥ b ≥ c ≥ d ≥ e), like the word "spied".

Approach: Mirrors lexico.py but uses all(a >= b for a, b in zip(word, word[1:])). Prints all reverse-sorted words, the count, and the first/last alphabetically.

Advanced

letter_coverage.py

Question: What is the minimum number of five-letter words needed to cover the first N letters of the alphabet? (Every letter a, b, c, ... up to the N-th must appear in at least one chosen word.)

Approach: This is a dynamic programming problem, solved in O(W²) time for W words. Each word is converted to a boolean bit vector of length 26 (or N). The DP iterates over words, at each step considering whether combining the current word with a prior-word's best coverage yields better coverage (more letters covered, or same coverage with fewer words). A backtrack array reconstructs the optimal word sequence. This problem is nontrivial — Knuth mentions results about it in TAOCP. See Letter Coverage for a deeper treatment.

Automation

run_all.sh

A bash script that runs all of the above Python scripts in sequence and saves their output to an output/ directory (one .txt file per script). Useful for regenerating all results at once. The scripts run are: diff_by_n, distinct, euclidean_distance, letter_coverage, lexico, near_palindromes, palindromes, reverse_lexico, stats, and tries.

Flags

@@ Line 17: / Line 17: @@
 In the text, Knuth also mentions letter coverage - finding the number of five-letter words that can cover the first N letters of the alphabet. This is a more complicated task that requires some dynamic programming. See [[Letter Coverage]].
+===Utility===
+====get_words.py====
+A simple utility module that loads the SGB words from <code>sgb-words.txt</code> and returns them as a Python list of strings. All other scripts import <code>get_words()</code> from this module. It reads the file, splits on newlines, and drops the trailing <code>\n</code> from each line.
+===Warm-Up Exercises (AOCP Volume 4, Exercises 26–37)===
+These scripts correspond to Knuth's suggested warm-up exercises for working with the word list.
+====distinct.py (Exercise #27)====
+'''Question:''' How many SGB words contain exactly ''k'' distinct letters, for 1 ≤ k ≤ 5?
+'''Approach:''' Iterates over all words, converts each to a set of characters, and bins them by the size of that set (1 through 5). Prints the count and first five examples for each ''k''.
+* Words with 1 distinct letter: 0 (no five-letter words use the same letter five times)
+* Words with 5 distinct letters: the majority of the list
+====diff_by_one_fixed.py (Exercise #28)====
+'''Question:''' Find all pairs of SGB words that differ by exactly one letter in each position (e.g., "rover" and "spuds": r→s, o→p, v→u, e→d, r→s — each position shifts by exactly 1).
+'''Approach:''' Uses recursive backtracking to generate all candidate strings that are ±1 away from a given word at each letter position, then checks which candidates exist in the word set using a hash table. There are '''38''' such pairs in the SGB.
+====diff_by_n_fixed.py (Exercise #28 Variation)====
+Generalizes <code>diff_by_one_fixed.py</code> to words that differ by a distance of ±''d'' at each letter position, for arbitrary ''d''. Uses the same backtracking-and-hash-lookup strategy, corrected from an earlier buggy approach.
+====palindromes.py (Exercise #29)====
+'''Question:''' Find all SGB words that are palindromes (e.g., "kayak"), and all palindrome pairs (two words that are reverses of each other, e.g., "regal" and "lager").
+'''Approach:''' Defines <code>is_palindrome()</code> — checks that positions 0=4 and 1=3 match. Defines <code>is_palindrome_pair()</code> — checks that word1 reversed equals word2. Iterates over all words (and all pairs) to collect matches. Palindromes are excluded from palindrome pairs (a word is not a "pair" with itself).
+====lexico.py (Exercise #30)====
+'''Question:''' Find all SGB words whose letters appear in strictly increasing lexicographic order (a ≤ b ≤ c ≤ d ≤ e), like the word "first". Identify the first and last such words alphabetically.
+'''Approach:''' Uses <code>all(a <= b for a, b in zip(word, word[1:]))</code> to check each word. Prints all matching words, the count, and the first/last alphabetically.
+====euclidean_distance.py (Exercise #31)====
+'''Question:''' Compute the Euclidean distance between five-letter words, treating each letter as a coordinate (a=0, b=1, ..., z=25).
+'''Approach:''' Converts words to 5-dimensional vectors via <code>word2vec()</code>, then computes the L2 norm (square root of sum of squared differences). Randomly samples 100 word pairs, sorts by distance, and prints the farthest pairs. This reveals which words are "farthest apart" in letter space (e.g., "zyzzy" vs. "aaaaa").
+====stats.py====
+'''Question:''' Compute letter and bigram frequency statistics across all 5,757 words.
+'''Approach:''' Uses Python's <code>collections.Counter</code> to tally single-letter frequencies (how often each letter a–z appears across all positions) and bigram frequencies (how often each adjacent pair of letters appears). Results are sorted from least to most frequent and printed. Useful for understanding the distribution of letters in common English five-letter words — e.g., 'e' is the most common letter, 'q' and 'z' are rare.
+====tries.py (Exercise #35)====
+'''Question:''' Which letters of the alphabet can serve as the root of a ''complete binary trie'' within WORDS(n) — that is, a trie (prefix tree) where every path of length 5 from root to leaf represents an SGB word, and every internal node has at least 2 child branches?
+'''Approach:''' Implements a <code>TryTrieTree</code> class with recursive assembly (top-down, checking whether enough words exist at each depth to sustain a binary tree) and a bubble-up pass (bottom-up, pruning branches with fewer than 2 words). Then tests each letter a–z as a root. Also includes <code>trie_table()</code>, which computes how the number of "perfect tries" grows as ''n'' (number of words considered) increases from 1000 to 5757. Knuth's Volume 4 discusses this in detail as a combinatorial structure over the word set.
+===Variations (Extensions Beyond the Exercises)===
+These scripts extend or vary the warm-up exercises.
+====diff_by_n.py====
+A cleaned-up and generalized version of the "diff by one" exercise. Finds all pairs of SGB words whose letters differ by ±''d'' at each position, for ''d'' = 1, 2, 3. Uses recursive backtracking to generate candidate words and a set-based lookup. Reports the number of pairs found and the time taken for each distance.
+====near_palindromes.py (Exercise #29 Variation)====
+'''Question:''' Find words that are ''near-palindromes'' — words whose first/last and second/fourth letters are within a small edit distance of each other, even if they are not exact palindromes.
+'''Approach:''' Computes the sum of absolute differences between positions (0,4) and (1,3). Words with a total distance ≤ 1.0 are classified as near-palindromes. This catches words like "about" (a/t differ by 19, b/u differ by 19 — not near) versus words where pairs are only one or two away.
+====reverse_lexico.py (Exercise #30 Variation)====
+'''Question:''' Find all SGB words whose letters appear in strictly ''decreasing'' lexicographic order (a ≥ b ≥ c ≥ d ≥ e), like the word "spied".
+'''Approach:''' Mirrors <code>lexico.py</code> but uses <code>all(a >= b for a, b in zip(word, word[1:]))</code>. Prints all reverse-sorted words, the count, and the first/last alphabetically.
+===Advanced===
+====letter_coverage.py====
+'''Question:''' What is the minimum number of five-letter words needed to cover the first ''N'' letters of the alphabet? (Every letter a, b, c, ... up to the ''N''-th must appear in at least one chosen word.)
+'''Approach:''' This is a '''dynamic programming''' problem, solved in O(''W''²) time for ''W'' words. Each word is converted to a boolean bit vector of length 26 (or ''N''). The DP iterates over words, at each step considering whether combining the current word with a prior-word's best coverage yields better coverage (more letters covered, or same coverage with fewer words). A backtrack array reconstructs the optimal word sequence. This problem is nontrivial — Knuth mentions results about it in TAOCP. See [[Letter Coverage]] for a deeper treatment.
+===Automation===
+====run_all.sh====
+A bash script that runs all of the above Python scripts in sequence and saves their output to an <code>output/</code> directory (one <code>.txt</code> file per script). Useful for regenerating all results at once. The scripts run are: <code>diff_by_n</code>, <code>distinct</code>, <code>euclidean_distance</code>, <code>letter_coverage</code>, <code>lexico</code>, <code>near_palindromes</code>, <code>palindromes</code>, <code>reverse_lexico</code>, <code>stats</code>, and <code>tries</code>.
 ==Flags==