Jim Vellenga logo

Palindromic Pangram

Data Structures


Fundamental Data Structures

Character Sets

The sample dictionary file furnished by ITA Software contains only lower-case letters. (The program checks to make sure that this is true for any other dictionary file furnished as input.) Thus one can represent a set of characters ("set" in the mathematical sense) as a vector of 26 bits.

In this implementation, a partial definition of the CharSet class is

class CharSet {
private:
  uint32 letterSet;
....

Similarly, a set of two-character sequences can be represented by a vector of CharSet:

class PairSet {

private:
  CharSet pairSet[PP_ABC_SIZE];
....

where PP_ABC_SIZE has been defined to have the value 26.

A Tree of Char Nodes

A number of my algorithms make heavy use of a tree of char nodes, which is two classes working together symbiotically, together with an iterator class that traverses the tree.

The character tree (class CharTree) is a tree of CharNode, where the principal elements of a CharNode are defined by

class CharNode;

typedef map CharNodeMap;
typedef CharNodeMap::iterator CharNodePtr;
typedef pair CharNodeRange;

class CharNode {
private:
  // depth -- distance of this node from the root node of the
  // tree, which has a depth of 0.
  uint16 d_depth;
public:
  // id -- the character that this node represents
  const char id;

  // Characteristics of this node.
private:
  CharFlags d_flags;

public:
  CharNode *parent;
  CharNodeMap children;
....

Storing the strings “an,” “and,” “ant,” and “ante’ in the tree results in the structure shown in the following diagram.

One can store strings both backwards and forwards in the tree.  This makes it easy to find all words beginning with a particular sequence of letters (such as “an”), or ending with a particular sequence of letters if the words are stored in reverse.

The children of a node are stored using the Standard Template Library (STL) “map” construct. However, since the character itself is stored in the node, I will eventually convert the tree to using the “set” construct instead.

A good hash table would provide a faster lookup for whole words, but the tree structure provides a relatively fast lookup for subsequences of words.  When the nodes are marked appropriately with “end of word” nodes, the tree structure also makes it easy to determine that “ante,” for example, begins with the characters of either “an” or “ant”.  This latter feature was used in looking for words that were composites of two or more other words.

Some usage statistics:

If I store the each word as a forward sequence relative to the root, the tree contains 378619 character nodes for either WORD_LST or WORDAI_LST (not yet subjected to Search Space Reduction).

If I store each word both as a forward sequence and as a reverse sequence, the number of character nodes rises to 846493 (more than double).