Jim Vellenga logo

Palindromic Pangram

Data Analysis


Characteristics of the Data

I chose to start by gathering characteristics of the data.  For this, I started with the file WORD.LST furnished by ITA Systems.

First-Order Characteristics

A first step consisted of one pass through all the words of the file, as encapsulated in the function SanityCheck().  This produced the following information:

WORD.LST contains 173528 words.

The shortest word has 2 letters.

  The first shortest word is "aa".

The longest word has 28 letters.

  The first longest word is "ethylenediaminetetraacetates".

The number of words containing an unexpected character is 0.

Auxiliary Word List

Observation:  There are no 1-letter words in WORD.LST, which makes it impossible to construct the following palindrome:

  lid off a daffodil

Accordingly, I also derived a second file WORDAI.LST, which adds the words “a” and “i” to the words from WORD.LST, to test that the program can also handle dictionaries with single-letter words.

The SanityCheck() function for this file produced:

WORDAI.LST contains 173530 words.

The shortest word has 1 letters.

  The first shortest word is "a".

The longest word has 28 letters.

  The first longest word is "ethylenediaminetetraacetates".

The number of words containing an unexpected character is 0.

Upper Bound on String Storage

From the first data, we find that an upper bound on the memory needed to store all words from the dictionary is 29 (including a 0 byte to terminate each string) * 173530, which is about 5 MB.  This is well within the address space of a 32-bit computer; and indeed, mine has 2 GB RAM.

The actual size of the WORD.LST file is 1749989 bytes, for an average of about 10 characters per word.

Some online word lists have upwards of 600,000 words or phrases, which at about 10 characters for words comes to about 6 MB of storage.

Accordingly, although the first version of the software read the file of words and discarded them, it now reads the words once and stores them in a vector<string>.

Letter Frequencies

I added a check to print out the letter frequencies.  For WORD.LST, these are

Reporting character frequencies...

Char e occurs 181468 times.

Char s occurs 149998 times.

Char i occurs 141877 times.

Char a occurs 119374 times.

Char r occurs 111514 times.

Char n occurs 107873 times.

Char t occurs 105292 times.

Char o occurs 103921 times.

Char l occurs 83709 times.

Char c occurs 64383 times.

Char d occurs 53326 times.

Char u occurs 51495 times.

Char p occurs 46306 times.

Char m occurs 44744 times.

Char g occurs 42577 times.

Char h occurs 36648 times.

Char b occurs 29017 times.

Char y occurs 25697 times.

Char f occurs 19455 times.

Char v occurs 15396 times.

Char k occurs 13433 times.

Char w occurs 11774 times.

Char z occurs 7511 times.

Char x occurs 4623 times.

Char q occurs 2549 times.

Char j occurs 2501 times.