POTM:
   MOO

This is a "journal" of my attempts at tackling
the "MOO" problem, a variant of Bulls and Cows.
I say "journal" in quotes because I wrote most
of it after doing most of the work, not as I went,
therefore dates and data are not as fully available
as I would like.

The problem is at:
   http://members.tripod.com/~POTM

---------------------------------------------------

First thoughts:

   How can we use english frequency data?
   How can we even tackle the problem?

Clearly the optimal strategy is to split the search
space maximally.  (Well, unless you want to win by
luck, and allow your search tree to be biased; and
of course you might well want your search tree biased
in favor of English words and against 'AAAAAAAAA';
but let's set that aside as a refinement on the general
question.)

Optimal splitting of the search space is probably
way too hard to compute.  We can precompute an
optimal "static split", wherein we use the same
guesses regardless of the results.  This is optimal
in the sense that it is the best for any attack
that ignores the feedback.  (At least during this
"opening" game.)

I wrote a random list of nine words, and fed my
/usr/dict/words (4-9 letters long) only in.  After
one tweak, this partitioned most of usr/dict/words
uniquely--17,000 singeltons, and < 1000 doubletons
and triplets.  (doubleton == two words which have
the exact same guess feedback)

However, I then began to suspect this isn't really very useful
if you're not allowed to use a dictionary when guessing;
you'll have split up the space of "real words", but
not the space of "candidate words my program will
end up pursuing"--the latter is what you need to
optimize for.  [Some of this insight came much later;
or rather, such a clear explanation of what really
matters came later.]

---------------------------------------------------

Dan "dfan" Schmidt immediately wrote a program to
referee a game and let *him* play it, so he could
devise strategies.  I instead decided to explore doing
something humans can't easily do: deep tree search.
It might be the case that given the current position,
only one string will validly match all the preceding
guesses.

A simple recursive routine like this will
work:

   int build(char *str, int dist)
   {
      int i,res;

      if (dist == length)
         return is_valid(str);

      for (i='A'; i <= 'Z'; ++i) {
         str[dist] = i;
         res = build(str, dist+1);
         if (res) return res;
      }
      return FALSE;
   }

Furthermore, I can just throw this at the
entire problem.  However, an initial guess
of 'AAAAAAAAA' isn't going to help much.
On the other hand, I can use those starting
words from my other list.

This led to a program called "buzzard2.c",
which had a 2-word starting guess, and then
applied the above.  It was slow, but was using
way fewer guesses than the programs my friends
were writing at the same time (except somebody
who wrote a program that used a dictionary
just to see how efficient that could be).

---------------------------------------------------

On observing some stupidity (with only two
static guesses, eventually buzzard2 settled
on STA_E while searching for STATE, and then
proceeded to guess every letter alphabetically
for the _, excepting those few letters that
had appeared in the opening book that it had
determined didn't belong), I expanded the
opening book to 6 words--sufficient to guess
the entire alphabet for all but 4-letter words.
This made buzzard6 converge faster.  It also
made it likely that some opening book words would
score all-misses, so some simple pruning in
build() sped buzzard6 up a lot over buzzard2:

   int build(char *str, int dist)
   {
      int i,res;

      if (dist == length)
         return is_valid(str);

      for (i='A'; i <= 'Z'; ++i) {
         if (REJECT(i)) continue;
         if (DISALLOW(i, dist)) continue;
         str[dist] = i;
         res = build(str, dist+1);
         if (res) return res;
      }
      return FALSE;
   }

REJECT handles letters that were in all-miss guesses;
DISALLOW handles letters that were in that position
in 0-hit guesses.

Still, buzzard6 was pretty slow.  Here was some test
data I collected, allowing it to run as long as 15
minutes on a P2-400:

   BABY       11   (this had been 10 on buzzard2, I believe)
   STATE      11
   CHORAL     11
   ENVIOUS    13
   ACQUIRES
   HIERARCHY
   FLOOD      12
   RHYTHM     13
   STEROID    14
   QUIZZES
   DANGEROUS
   SYZYGY     12

Pretty rocking!  This was kicking ass over my friends'
programs.  Of course, it was too slow, but I was pretty
sure I could write clever pruning code and get the speed
back without changing the numbers at all.  (If not, it
could always punt out early with an incomplete guess,
hopefully narrowing the search space for the next
iteration.)
 
Here's the opening book from buzzard6, comment intact:

char *tables[6][20] =
{
// try to improve these tables genetically or simulated annealing or something
   { "ETAO", "INSH", "RDLU", "MPCG", "WYBF", "JKQV", "XZZZ" },
   { "ETAOI", "NSHRD", "LUMPC", "GWYBF", "JKQVX", "ZTNDG" },
   { "ETAOIX", "NSHRDX", "LUMPCZ", "GWYBFX", "JKQVXX", "ZTNDZZ" },
   { "ETAOXXX", "INSHZZZ", "RDLUQQQ", "XXXMPCG", "ZZZWYBF", "QQQQJKV" },
   { "ETAOXXXX", "INSHZZZZ", "RDLUQQQQ", "XXXXMPCG", "ZZZZWYBF", "QQQQQJKV" },
   { "ETAOXXXXX", "INSHZZZZZ", "RDLUQQQQQ", "XXXXXMPCG", "ZZZZZWYBF", "QQQQQQJKV" },
};

You don't want 9 unique letters per guess with 9-letter
words, because the odds are high that at least one letter
won't miss, and that means REJECT pruning won't occur.

---------------------------------------------------

Well, it's still probably too large a search space to find
the optimal opening book static guesses, but I wanted to
refine it, and I decided one refinement would be
trying to increase the possibility of getting hits
in the opening book.  This led me to the idea of
collecting stats of letter frequency based on position.
(I could and perhaps should have collected letter
frequency by position by word length, but I didn't
feel like dealing with that complication.)

I collected stats like this:

a:  1153  2150  1403  1093  1150  1182  1028   549   191    27 
b:  1018   137   363   311   314   198   117   105     5     0 
c:  1649   302   779   758   549   497   325   252   168    79 
[etc.]

Then I started messing around with building a book that
better fit these frequencies, but set it aside when I
realized how easily these would fit in with the "build()"
logic; instead of iterating from A..Z, iterate in order
of letter frequency *for that position*.  I took the above
table, sorted it on a particular column, then went in by
hand and concatenated all of the letters.  Because I wasn't
really thinking, the sort ended up being from least frequent
to most frequent, which is why the frequency table code in
my finished program is still "backwards" as well.

This change changed the stats for buzzard6, so it was
time to give it a new name for publishing on the page
I was maintaining for my friends with stats on the above
words; I called it "deepmoo", drawing parallels with chess-playing
programs based on use of an opening book and deep tree search,
combined with its inhuman ability to draw conclusions based
on all the previous guesses.

Finally, I undertook tree pruning.  The crucial aspect of
pruning I saw at this point in time was that it would
merrily descend down into the tree even though a given
prefix was already doomed to not satisfy some basic
properties required by the known guess information.
For example, if there were 7 hits in a prior nine letter guess,
then if there are no hits in the first three letters of
a tentative solution, it can't be a real solution.

So I wrote a function called 'score_partial_guess',
and then added logic to decide whether I could prune it out.
Nonetheless, all pruning logic written at this point
were based on individual past guesses considered apart
from other guesses; the deep tree search resulted in
inferences across multiple guesses, but none of the pruning
did.  This was somewhat problematic; still I got enough
of a speedup by this (and then by replacing score_partial_guess
by a very convoluted incremental system, which kept running
hit/miss counts for each old guess vs. the current partial
word, and updated them in O(1) as I added a new letter).

(I don't have the original deepmoo scores, but they're
pretty similar to the ones below.)

---------------------------------------------------

Also, around this time (maybe before it), I decided
to try to put in trigraph support.  I had collected
trigraph stats earlier (even gave them away to my
friends), but hadn't tried to use them.  So I hooked
them up.  I basically just used them as a binary
test: if the current letter to be added made a
trigraph that ever appeared, ok; if not, then save
it for last at this branch of the tree.

The problem with this approach is it made the
trigraphs meaningful near the top of the tree
(to the left of the word), but not to the right;
it would happily explore words with total gobbledygook
on the right before fully exploring the non-gobbledygook.
Also, it worsened my average guess score (I had 40 7-letter
words I used as my testbed) by about 0.4, so I didn't
use them.

Later it turned out this was due to a bug; I never
actually bothered studying the code or the output
carefully to make sure it was doing the right thing,
since that would have required me making reference
to an actual triplet table.

---------------------------------------------------

I was studying the output of deepmoo's guesses on
a particular word, and was a little surprised at
an inference it had made.

A little thought led me to this wonderous revelation
about something you can do with hits and misses as
opposed to bulls/cows (i.e. mastermind-style black/white).

Consider a guess like:
    "EEEEAAO"

You can use the number of misses as a binary number
to read off exactly which of E, A, and O appear in
the word, with no ambiguity!  (E.g. 5 misses = 4 misses
for the 'E' and 1 miss for the 'O'.)

Double whammy:
   (a)  I didn't have to lengthen my opening book much,
        because my 7+ letter words already only had
        4 unique letters each anyway.
   (b)  deepmoo didn't require any special "binary
        counting" logic; I just updated the opening
        book, and the perfect-deep-search technology
        did its magic!

This significantly improved my results on 7+ letter
words.

---------------------------------------------------

Time for better pruning.  Now I could go through and
accurately determine if certain letters were definitely
not present or definitely were present.  (This was all
in the pruning logic and didn't affect the search
results.)  Rather than use binary math, I just wrote
some code to solve a general integer packing problem,
and used that to evaluate whether a given letter was
or was not present.  (Actually, this isn't quite true;
as I recall, I only tested on one of these initially.)

At this point, deepmoo was fast enough (some rare
worst case 9-letter words had one guess that took
~2 seconds on my P2-400).  I had more pruning up
my sleeve, but that would come later.

---------------------------------------------------

Now I was ready to improve the guess count.  For a
while it had bugged me that the full opening book
was always pursued, even if it was irrelevent.  For
a word with no repeats, once deepmoo had identified
all the required letters, it could stop the opening
book and just move on to guessing.  Sure, it's useless
if the word has repeated letters or uses one of the
last letters in the book, but what the heck, it couldn't
worsen things.

So I wrote special logic to count up 'partially determined'
letters in the opening book (if a guess "ETAO" gets 2
misses, we've found 2 letters althought we don't know
which they are); each guess is still handled independenly,
so I had to require the opening book to have no letters
in common (except the last word, which would never affect
this decision making).

Woo hoo!  Big speedup for a lot of 4 and 5 letter words.

// opening book
char *tables[][20] =
{
   { "ETAO", "INSH", "RDLU", "MPCG", "WYBF", "JKQV", "XZZZ" },
   { "ETAOI", "NSHRD", "LUMPC", "GWYBF", "JKQVX", "ZTNDG" },
   { "ETAOIX", "NSHRDZ", "LUMPCQ", "GWYBFV", "JKXZQV", },
   { "ETTAAAA", "OIINNNN", "SHHRRRR", "DLLUUUU", "MPPCCCC", "GWYBFJK", "VQXXZZZ" },
   { "ETTAAAAA", "NOIINNNN", "SHHRRRRR", "DLLUUUUU", "MPPCCCCC", "GWWYYYBB", "FFJJKQXZ" },
   { "ETTAAAAAA", "OIINNNNNN", "SHHRRRRRR", "DLLUUUUUU", "MPPCCCCCC", "GWWYYYYBB", "FJJKKKQXZ" },
};

(You can see I deviate from binary after five guesses--I
lose on words which use the later letters, but save on
words-with-repeats that don't, because I get to "early out"
in detecting that none of them appear.  deepmoo is smart
enough that that's not that big a deal; it can pick up the
unresolved letters at the same time that it's "anagramming".)

At this point (it was never important before) it's important
that the opening book be in letter frequency order.  Although
I never got around to generating the frequencies past "ETAOINSHRDLU",
and just faked them.  In fact, none of the above had really
been tuned, especially 5 and 6 letter words.

Results:

            buzzard6   deepmoo
   BABY         11        11
   STATE        11         9
   CHORAL       11         9
   ENVIOUS      13        13
   ACQUIRES               13
   HIERARCHY              19
   FLOOD        12        12
   RHYTHM       13        13
   STEROID      14        10
   QUIZZES                12
   DANGEROUS              12
   SYZYGY       12        10

Here's the run of deepmoo on HIERARCHY.  The opening book is
7 guesses.  As you can see, it knows exactly what letters there
are in the word at guess 8.  You can also see that deepmoo
pretty much sucks at this sort of pure endgame--with three
repeated letters, it takes a long time to collect the information.

Guessing 'HIERARCHY'
 1 ETTAAAAAA 1 2
 2 OIINNNNNN 1 7
 3 SHHRRRRRR 2 1
 4 DLLUUUUUU 0 9
 5 MPPCCCCCC 1 3
 6 GWWYYYYBB 0 5
 7 FJJKKKQXZ 0 9
 8 CEIARRCYH 2 0
 9 CIEAHCRRY 3 0
10 CIERCAHYR 3 0
11 CIYEARHRC 3 0
12 CIYRRCEHA 3 0
13 CIHERAICY 2 0
14 CHIRACHEY 3 0
15 CYIRHAERC 1 0
16 AIERARCHY 8 0
17 RIERARCHY 8 0
18 IIERARCHY 8 0
19 HIERARCHY 9 0

You can also see the stupidity at the end, but I honestly
have *no* clue of any strategy that can better resolve this
situation if you're guessing purely abstract strings.
(This seems to be the big limitation of hits/misses instead
of bulls/cows).

Clearly, it could do better if it made use of knowledge
of the english language.  Hey, and AIE and IIE are pretty
unlikely trigraphs...

---------------------------------------------------

I went back to my trigraph code, and found a bug!
Most of my array accesses are wrapped in macros
(e.g. REJECT, DISALLOW) which automatically convert
from uppercase letters to array indices.  The
trigraph tables didn't have any macros, and I
was accessing them without the necessary subtract,
so it was indexing randomly into the table, effectively
randomizing the search order (and hence worsening
the results).

Fixing the bug gave me an average savings of 1.0 per word
for my 7-letter test words!  But I had to revert to
an old version of the program before I added the
early-short-circuiting of the opening book, because
I had deleted the trigraph code from my code...

            buzzard6   deepmoo    trigraph improvements:
   BABY         11        11              9
   STATE        11         9              8
   CHORAL       11         9
   ENVIOUS      13        13             11
   ACQUIRES               13             11
   HIERARCHY              19             12
   FLOOD        12        12              9
   RHYTHM       13        13             11
   STEROID      14        10
   QUIZZES                12             11
   DANGEROUS              12
   SYZYGY       12        10              7

Omitted entries performed worse with the revised
program, because they have no repeats and early-out
in deepmoo, but not in the trigraph-enabled program.

At the moment, my intention is to just *only* guess
trigraph-allowed words, rather than having the
top-of-the-tree vs. bottom-of-the-tree problem.
If I want completeness (e.g. guessing strings that
weren't in my training dictionary), I can do a
full traversal of all strings after exhausting all
valid combinations of trigraphs.

I'm going to upgrade to quadgraphs with an
expectation of achieving inhuman performance
in return.  I may even be able to shorten the
opening book that way.

---------------------------------------------------

Feb 15, 1999

I spent a long time exploring the notion of "wordspace".

It's illegal for us to use a dictionary.  It's not
illegal for us to use digraphs or trigraphs or
quadgraphs, though, is it?  (Well, quadgraphs
for 4-letter words would be a dictionary!)

I don't know how to reconcile this, since any restrictions
on the "set of words the program might generate" could
be interpreted as a dictionary.  And the workaround
(explore the "dictionary" first, then fallback on the
full set of strings) prevents that definition from literally
applying.

Even just applying a little bit of smarts (you can't have
the same three letters appear in a row) restricts the
wordspace, so it doesn't seem fair to define a dictionary
as a restriction of a wordspace.

But what if I lossily compress a dictionary so that the
wordspace I pursue is the original dictionary plus
another 10% of garbage words or so?

What if it's 100% bigger?  What if it's 10 times
as big?  What if it's 100 times as big?

Here's information about some "restricted wordspaces"
or "synthetic dictionaries" for 9-letter words:

              N   
5,429,503,678,976  Strings from AAAAAAAAA...ZZZZZZZZZ
  184,511,547,557  Three adjacent trigs (using 5693 found in the dictionary)
   62,902,335,833  Three adjacent trigs (using 3977 found in 9 letter words)
    1,832,210,380  Three adjacent trigs (1289 1st, 1988 2nd, 715 3rd)
    1,191,748,033  All 7 trigraphs valid (of 3977)
      100,129,697  All 7 trigs valid for that position
       65,278,757  All 6 quadgraphs valid (of 12580)
        3,813,306  All 6 quads positionally valid from full dict
          449,505  All 6 quads positionally valid from 9-word dict
            6,475  9-letter words in the computer dictionary I've constructed

Consider now that deepmoo was originally searching the full
space, but with letter-frequency biases.  My recent trigraph
change moved the search space down to the second line.

The second-to-last line is less than 100x as big as the
original dictionary.  It is 40,000x smaller than the
current space explored by deepmoo.

Unfortunately, it requires quadgraph stats uniquely for
each position in the word, for each length of word.
That's 21*26*26*26*26 bits (over 1M of data).  There's
a lot of redundancy in that data, if you know how
it's substructured; by encoding them relative to trigraphs,
and encoding the trigraphs relative to digraphs, I've
managed to pack all of that into about 80K.  Of course,
the original dictionary is about 300K and compresses
with PKZIP to 100K, so we've lost a lot of information;
clearly simply compressing the dictionary would be
simpler.

I would create a fictional smaller dictionary which
produced the identical (or slightly lossy) stats, if
that were feasible in the space required, but it
wouldn't actually be small enough, and it might be
interpreted by the contest moderator as a violation
of the rules (even though it would really just be
a way of encoding the stats).

I can probably manage to encode the second line up
(the ~4M 'wordspace').  It saddens me to not get
the brutally tight 449,505 "words", but oh well.

I printed those 449,505 "words" to a file; here's a sample:

   abandabad abaterful abhoriton abomissen absordine
   aestinels amortiest appartina assurrier awartency
   bariality beholatti betranges blateware boycotted
   bufferser bumberade bungenity burgative burrifier
   buttentic bystament cabitiary calipheme calloidal
   camburson camperity candacist candwagor canterina
   capricial carcasist carditory carennell

Sure, they all look pretty silly, but imagine a moo-playing
program all of whose guesses looked like that.  You would
have to grant that while it didn't really know English
(the way a human actually knows a vocabulary), it sure
knows how to construct words that fit the patterns
of English.

In fact, consider this:  Once you know that the only
letters in the word are the letters from 'HIERARCHY'
(which it did after 7 guesses), there are only 76
"words" in this wordspace which fit that constraint
and the 449,505 unique-position quadgraph technique:

    accracier arachaire archarier archerary archerche
    archerchy archerich archerier carcerich carcerier
    carchaire carracier carriarch carricacy carricher
    characier cherarche cherarchy cherarier chericacy
    chericher chicherer chicheria chicheric chicherie
    crachaire creachere creachery creachier hairchere
    hairchery hairchier harachere harachery harachier
    harracier harriarch harricacy harricher hearchere
    hearchery hearchier herachere herachery herachier
    hercerich hercerier herchaire herichier hierarche
    hierarchy hierarier reachaire rearchere rearchery
    rearchier recharier recherary recherche recherchy
    recherich recherier rehearier richerary richerche
    richerchy richerich richerier yacherary yacherche
    yacherchy yacherich yacherier yearchere yearchery
    yearchier

I have a feeling deepmoo would be able to solve this
in about two or three guesses tops (after the opening
book).

Sadly, as I said, I can't find any way to fit the
necessary positional quadgraph information into the
allowed space.  Oh well.

---------------------------------------------------

2-19-98

Positional quadgraphs that aren't unique to each
length (shared between differing lengths, but
unique to each position) totally rock.  deepmoo
can now process about 100 words per second, so
it'll take five or six minutes to process my entire
dictionary.  The speed up is because using quadgraphs
so enormously trims the search space, the search goes
much faster.  With this much speed, I can tune the opening
book to the entire dictionary!

In fact, it goes so fast that I added a (noticeably
slower) mode which, during the opening book, goes
ahead and scans the search space to see if there's
only one constructible word that matches the current
opening book; if so, it aborts out even earlier than
normal.  This is only a small savings over the basic
positional quadgraph code, though; the latter of which
saved around 30% of guesses on average.

Unfortunately, I only get 25K of storage, and
I can't get the positional quadgraphs down
below 40-60K.

Current results; improvements are marked with a *;
I expect 4-letter words will do better with custom
trigraph support.  (Need to avoid using a list of
'4-letter strings' for 4-letter words.)

   older version without short-circuiting opening book
               |   previous (non-trigraph) version
               |       |    quadgraph verison
               |       |      |
BABY          11      11     10*
STATE          9       9      7*     
CHORAL        11       9*     7*
ENVIOUS       13      13      8*
ACQUIRES      13      13      9*
HIERARCHY     19      19      8**
FLOOD         12      12      8*
RHYTHM        13      13      8*
STEROID       13      10*     6*
QUIZZES       12      12      8*
DANGEROUS     13      12*     7*
SYZYGY        10      10      6*

The aforementioned "abort the book early if only
one word fits" seems to fire about 1 out of 20
words, but none of these words did.

Next up is figuring out how close to that I can get
within the space limitations, and then tuning the
opening book.

Here are some actual games that it plays. Note
that on many, it simply guesses correctly on the
first guess after it stops the opening book (which
suggests I may want to shorten the opening book);
I've marked the first non-opening book guess with a '*'.

Guessing 'STATE'       Guessing 'CHORAL'
 1 ETAOI 2 2            1 ETAOIX 0 4
 2 NSHRD 0 4            2 NSHRDZ 1 4
 3 LUMPC 0 5            3 LUMPCQ 0 4
 4 GWYBF 0 5           *4 SECROP 1 3
 5 JKQVX 0 5            5 SOULDE 0 4
 6 ZTNDG 1 4            6 CAIRLS 2 2
*7 STATE 5 0            7 CHORAL 6 0

Guessing 'ACQUIRES'    Guessing 'HIERARCHY'
 1 ETTAAAAA 0 2         1 ETTAAAAAA 1 2
 2 NOIINNNN 0 6         2 OIINNNNNN 1 7
 3 SHHRRRRR 1 2         3 SHHRRRRRR 2 1
 4 DLLUUUUU 1 3         4 DLLUUUUUU 0 9
 5 MPPCCCCC 0 3         5 MPPCCCCCC 1 3
 6 GWWYYYBB 0 8         6 GWWYYYYBB 0 5
 7 FFJJKQXZ 0 7         7 FJJKKKQXZ 0 9
*8 CASKERIU 1 1        *8 HIERARCHY 9 0
 9 ACQUIRES 8 0

Guessing 'QUIZZES'     Guessing 'SYZYGY'
 1 ETTAAAA 0 6          1 ETAOIX 0 6
 2 OIINNNN 1 5          2 NSHRDZ 0 4
 3 SHHRRRR 0 6          3 LUMPCQ 0 6
 4 DLLUUUU 0 3          4 GWYBFV 0 4
 5 MPPCCCC 0 7          5 JKXZQV 0 5
 6 GWYBFJK 0 7         *6 SYZYGY 6 0
 7 VQXXZZZ 1 3
*8 QUIZZES 7 0

---------------------------------------------------

Feb 23, 1999

No positive data to report. The smallest I've managed
to compress the data down to is about 40K, which zip
has managed to reduce to something like 24782 bytes,
but this isn't really meaningful since those are 8-bit
bytes, and I need to code to 6-bit bytes.  (The 40K
is using 6-bit bytes; zip's compression scheme is good
enough to compress random 6-bit data down by 1/4.)

I came up with the original scheme for compressing
quadgraphs way back when I was originally playing
with trigraphs.  Let's start with trigraphs: the naive
approach is to notice that some rows of the table (a
giant 3-dimensional 26x26x26 table) are all 0s, and
omit them.  Interestingly, the list of which rows are
non-empty is the list of which rows
for which there is a valid trigraph like, say, 'kc*',
where this is the 'kc' row.  If you think about it,
'kc' is *only* going to be a valid digraph if that
row is not empty; it's an 'if-and-only-if' situation.

Now, for non-empty rows, you still have to encode
all 26 entries, which is kind of wasteful.  But,
hey, there's symmetry here.  'kca' is only a valid
trigraph if 'kc' is a valid digraph, and 'ca' is
a valid digraph!

Moreover, this works in a cool positional way.
'kca' can only be a valid trigraph in the fourth position
if 'kc' is a valid digraph in the fourth position,
and 'ca' is a valid digraph in the fifth position.

So, if we know the valid digraphs, we can easily
cull out a bunch of trigraphs.  Now, anytime we
get a combination that looks valid from the digraphs,
e.g. 'kca', we still have to store a bit which determines
whether this is *truly* a valid trigraph.  (It is,
because of 'bookcase', although it kinda sucks when
you generate a five-letter word ending in 'okc', or
a six-letter word endning in 'kca', which is why
ideally you'd separate out this data for each word
length, but that's just WAY TOO MUCH data.)

So, quadgraph compression can naturally build
on trigraph compression in the same way.

   BOOK = BOO + OOK
   OOKC = OOK + OKC
   etc.

BOOKCASE      BOOKCASE      BOOKCASE
BO            BOO           BOOK
 OO            OOK           OOKC
  OK            OKC           OKCA
   KC            KCA           KCAS
    CA            CAS           CASE
     AS            ASE
      SE

Now, compressing things this way, I'm culling out
a lot of 0's, that is, a lot of non-quadgraphs.
But I'm not culling out any 1's; I still have at
least 1 bit per original 1 bit.  Is this problematic?

The data that I'm storing (6*26*26*26*26 quadgraph
bits) is about 2.4M bits.  Of those bits, about
41K of them are non-zero.  I get 6 bits per character
in source code, so just counting the 1 bits, I'm
allowed 3 bits per 1 bit.  Tight, but not infeasible.

The above encoding, however, is not tight enough.
It averages around 4 bits per 1 bit, *plus* I have
to store all the trigraph and digraph tables, which
add up.

So, here are some clever schemes.  Up until this point,
I was encoding the quadgraphs based on trigraphs
generated from the original data, not from trigraphs
based on the quadgraphs themselves.  (They would be
basically the same.)  If I want to merge some quadgraph
data together, e.g. use the same quadgraph info for
two positions, I'll need to similarly merge the trigraph
info.  It might be better to simply independently generate
the trigraph info needed for each quadgraph.

Anyway, the quadgraphs were still just too sparse.
Then I had this clever idea.  Currently, I was storing
a bit for 'OKCA' iff 'OKC' was a trigraph in the first
position, and 'KCA' was a trigraph in the second position.

Let's write these trigraphs relative to the quadgraph,
for a moment.  I had a trigraph 'OKC*', and a trigraph
'*KCA'; therefore, a quadgraph 'OKCA' was also plausible.
Here's the sneaky trick: maybe I could store "trigraphs"
of the form "OK*A" and "O*CA" as well; and only if
ALL FOUR are valid do I encode a bit.  This will let
me squeeze out even more 0 bits; though of course every
1 bit will still need a 1 bit.

Guess what?  It worked!  I got the fully-positional
quadgraph data down to about 18K of source!  zip
compresses it to 13K!

Except...

All of those new pseudo-trigraphs aren't shared between
multiple quads, so there's now 3 times as many trigraphs.
Additionally, the new trigraphs, which have "holes" in
them, not longer represent sound linguistic patterns.
"SYZYGY" generates "S*ZY" and "Y*YG" and all sorts of
bizarre "trigraphs" that themselves don't compress very
well.

The best results I've gotten are to use one or the
other (i.e. "O*CA" or "OK*A" but not both); this squeezes
the quadgraphs down without adding *too* much trigraph
data.

This produces the aforementioned 40K, 25K zipped.