Search Space Reduction
Since the problem is almost certainly of exponential complexity, I sought to reduce the size of the search space by some preliminary checks, both for the original file WORD.LST and for the augmented file WORDAI.LST.
Substring Reversibility
An observation: If a character sequence appears in a palindrome, its reverse must also appear. For example, in “lid off a daffodil”, because “offada” appears in the forward direction, “adaffo” must also appear. The sequence and its reverse may or may not be overlapping.
This means that we can eliminate a word from the word list if it contains a sequence that cannot be constructed in reverse. For example, the sequence “auv” occurs in the sample word list in words such as “fauve” or “mauve”. However, no words in the word list contain the sequence “vua”. Nor can “vua” be constructed from a juxtaposition of two words from the word list, since no words end in “vu” (in this list) and no words begin with “ua”. For this reason, we can eliminate the words “fauve” and “mauve” from the set of words that can appear in a palindrome.
One- and Two-Letter Sequences
To check for reversibility, the program first finds all distinct one- and two-letter sequences that can appear at the beginning or end of phrases. For the two-letter sequences, the native pairs are those that occur at the beginning or end of words in the list, while constructed pairs also include one-letter words followed by a legal initial letter or preceded by a legal final letter.
File WORDAI.LST contains 2 one-letter words.
No. of distinct initial chars: 26
No. of distinct final chars: 26
No. of distinct letters in file: 26
Out of a possible 676 two-letter sequences...
No. of distinct two-letter words: 96
No. of native initial pairs: 318
No. of native final pairs: 361
No. of constructed initial pairs: 324
No. of constructed final pairs: 363
Three-Letter Sequence Checks
The next step was to find all three-letter sequences that were found natively in words or could be constructed from the beginnings and endings of words, and then eliminate those words that contained three-letter sequences for which the reverse could not be constructed. The results for WORDAI.LST:
Out of a possible 17576 three-letter sequences...
No. of contiguous triples: 7047
No. of constructed triples: 12317
After removing nonreversible triples...
Number of reversible triples: 10772
Now removing words containing nonreversible triples...
Word list now contains 165857 words.
This a 4.4% reduction in the number of words that can appear.
Four-Letter Sequence Checks
For four-letter sequences, the program first constructs a character tree containing all the four-letter sequences that appear in single words.
It then constructs the sets of three-letter sequences, initTriples and finalTriples, containing the starting and ending triples of all words with three or more letters. It also uses the one- and two-letter initial and final sequences found above.
Then, in order to add constructible quads to the tree, the program
- adds to initTriples:
- one-letter words followed by initial pairs.
- two-letter words followed by initial characters.
- adds to finalTriples
- final characters followed by two-letter words
- final pairs followed by one-letter words
- adds to the character tree as forward sequences
- final characters followed by members of initTriples
- final pairs followed by initial pairs
- members of finalTriples followed by initial characters
- invalidates each four-letter sequence that is not reversible (that is, for which the reverse sequence is not also in the character tree
- removes from the word list any words that contain a nonreversible four-letter sequence.
Results for WORDAI.LST:
Now removing words containing nonreversible quads...
Word list now contains 138244 words.
Omitting Composite Words
We can also reduce the size of the dictionary considerably by removing composite words. We can do this because any palindrome that contains the word “mahogany,” for example, can be constructed equally well with the three-word sequence “ma hog any.” This is particularly powerful for the words in the dictionary furnished as part of the problem because WORD.LST contains such unusual “words” as “es” and “ya,” which do not occur as such in most standard English-language dictionaries.
Effects of Word List Reductions
Initially:
|
File name |
WORDAI.LST |
WORD.LST |
|
Number of words in list |
173530 |
173528 |
|
Percent of original number |
100% |
100% |
|
Maximum word length |
28 |
29 |
|
Average word length |
9.1 |
9.1 |
|
Most frequent word length |
8 |
8 |
After removing nonreversible 3-letter sequences:
|
File name |
WORDAI.LST |
WORD.LST |
|
Number of words in list |
165857 |
165855 |
|
Percent of original number |
96% |
96% |
|
Maximum word length |
27 |
27 |
|
Average word length |
9.1 |
9.0 |
|
Most frequent word length |
8 |
8 |
After removing nonreversible 3-letter and 4-letter sequences:
|
File name |
WORDAI.LST |
WORD.LST |
|
Number of words in list |
138249 |
138202 |
|
Percent of original number |
80% |
80% |
|
Maximum word length |
25 |
25 |
|
Average word length |
8.9 |
8.9 |
|
Most frequent word length |
8 |
8 |
After removing nonreversible 3-letter and 4-letter sequences, as well as composites:
|
File name |
WORDAI.LST |
WORD.LST |
|
Number of words in list |
54245 |
60843 |
|
Percent of original number |
31% |
35% |
|
Maximum word length |
22 |
22 |
|
Average word length |
8.2 |
8.3 |
|
Most frequent word length |
8 |
8 |
A copy of the reduced word list for the original file, WORD.LST, can be found as ReducedWordList.txt. I have done a certain amount of testing of the algorithms and inspection of the results, but let me know if you find any mistakes.
Last updated: 22 July 2009.