POTM: MOO This is a "journal" of my attempts at tackling the "MOO" problem, a variant of Bulls and Cows. I say "journal" in quotes because I wrote most of it after doing most of the work, not as I went, therefore dates and data are not as fully available as I would like. The problem is at: http://members.tripod.com/~POTM --------------------------------------------------- First thoughts: How can we use english frequency data? How can we even tackle the problem? Clearly the optimal strategy is to split the search space maximally. (Well, unless you want to win by luck, and allow your search tree to be biased; and of course you might well want your search tree biased in favor of English words and against 'AAAAAAAAA'; but let's set that aside as a refinement on the general question.) Optimal splitting of the search space is probably way too hard to compute. We can precompute an optimal "static split", wherein we use the same guesses regardless of the results. This is optimal in the sense that it is the best for any attack that ignores the feedback. (At least during this "opening" game.) I wrote a random list of nine words, and fed my /usr/dict/words (4-9 letters long) only in. After one tweak, this partitioned most of usr/dict/words uniquely--17,000 singeltons, and < 1000 doubletons and triplets. (doubleton == two words which have the exact same guess feedback) However, I then began to suspect this isn't really very useful if you're not allowed to use a dictionary when guessing; you'll have split up the space of "real words", but not the space of "candidate words my program will end up pursuing"--the latter is what you need to optimize for. [Some of this insight came much later; or rather, such a clear explanation of what really matters came later.] --------------------------------------------------- Dan "dfan" Schmidt immediately wrote a program to referee a game and let *him* play it, so he could devise strategies. I instead decided to explore doing something humans can't easily do: deep tree search. It might be the case that given the current position, only one string will validly match all the preceding guesses. A simple recursive routine like this will work: int build(char *str, int dist) { int i,res; if (dist == length) return is_valid(str); for (i='A'; i <= 'Z'; ++i) { str[dist] = i; res = build(str, dist+1); if (res) return res; } return FALSE; } Furthermore, I can just throw this at the entire problem. However, an initial guess of 'AAAAAAAAA' isn't going to help much. On the other hand, I can use those starting words from my other list. This led to a program called "buzzard2.c", which had a 2-word starting guess, and then applied the above. It was slow, but was using way fewer guesses than the programs my friends were writing at the same time (except somebody who wrote a program that used a dictionary just to see how efficient that could be). --------------------------------------------------- On observing some stupidity (with only two static guesses, eventually buzzard2 settled on STA_E while searching for STATE, and then proceeded to guess every letter alphabetically for the _, excepting those few letters that had appeared in the opening book that it had determined didn't belong), I expanded the opening book to 6 words--sufficient to guess the entire alphabet for all but 4-letter words. This made buzzard6 converge faster. It also made it likely that some opening book words would score all-misses, so some simple pruning in build() sped buzzard6 up a lot over buzzard2: int build(char *str, int dist) { int i,res; if (dist == length) return is_valid(str); for (i='A'; i <= 'Z'; ++i) { if (REJECT(i)) continue; if (DISALLOW(i, dist)) continue; str[dist] = i; res = build(str, dist+1); if (res) return res; } return FALSE; } REJECT handles letters that were in all-miss guesses; DISALLOW handles letters that were in that position in 0-hit guesses. Still, buzzard6 was pretty slow. Here was some test data I collected, allowing it to run as long as 15 minutes on a P2-400: BABY 11 (this had been 10 on buzzard2, I believe) STATE 11 CHORAL 11 ENVIOUS 13 ACQUIRES HIERARCHY FLOOD 12 RHYTHM 13 STEROID 14 QUIZZES DANGEROUS SYZYGY 12 Pretty rocking! This was kicking ass over my friends' programs. Of course, it was too slow, but I was pretty sure I could write clever pruning code and get the speed back without changing the numbers at all. (If not, it could always punt out early with an incomplete guess, hopefully narrowing the search space for the next iteration.) Here's the opening book from buzzard6, comment intact: char *tables[6][20] = { // try to improve these tables genetically or simulated annealing or something { "ETAO", "INSH", "RDLU", "MPCG", "WYBF", "JKQV", "XZZZ" }, { "ETAOI", "NSHRD", "LUMPC", "GWYBF", "JKQVX", "ZTNDG" }, { "ETAOIX", "NSHRDX", "LUMPCZ", "GWYBFX", "JKQVXX", "ZTNDZZ" }, { "ETAOXXX", "INSHZZZ", "RDLUQQQ", "XXXMPCG", "ZZZWYBF", "QQQQJKV" }, { "ETAOXXXX", "INSHZZZZ", "RDLUQQQQ", "XXXXMPCG", "ZZZZWYBF", "QQQQQJKV" }, { "ETAOXXXXX", "INSHZZZZZ", "RDLUQQQQQ", "XXXXXMPCG", "ZZZZZWYBF", "QQQQQQJKV" }, }; You don't want 9 unique letters per guess with 9-letter words, because the odds are high that at least one letter won't miss, and that means REJECT pruning won't occur. --------------------------------------------------- Well, it's still probably too large a search space to find the optimal opening book static guesses, but I wanted to refine it, and I decided one refinement would be trying to increase the possibility of getting hits in the opening book. This led me to the idea of collecting stats of letter frequency based on position. (I could and perhaps should have collected letter frequency by position by word length, but I didn't feel like dealing with that complication.) I collected stats like this: a: 1153 2150 1403 1093 1150 1182 1028 549 191 27 b: 1018 137 363 311 314 198 117 105 5 0 c: 1649 302 779 758 549 497 325 252 168 79 [etc.] Then I started messing around with building a book that better fit these frequencies, but set it aside when I realized how easily these would fit in with the "build()" logic; instead of iterating from A..Z, iterate in order of letter frequency *for that position*. I took the above table, sorted it on a particular column, then went in by hand and concatenated all of the letters. Because I wasn't really thinking, the sort ended up being from least frequent to most frequent, which is why the frequency table code in my finished program is still "backwards" as well. This change changed the stats for buzzard6, so it was time to give it a new name for publishing on the page I was maintaining for my friends with stats on the above words; I called it "deepmoo", drawing parallels with chess-playing programs based on use of an opening book and deep tree search, combined with its inhuman ability to draw conclusions based on all the previous guesses. Finally, I undertook tree pruning. The crucial aspect of pruning I saw at this point in time was that it would merrily descend down into the tree even though a given prefix was already doomed to not satisfy some basic properties required by the known guess information. For example, if there were 7 hits in a prior nine letter guess, then if there are no hits in the first three letters of a tentative solution, it can't be a real solution. So I wrote a function called 'score_partial_guess', and then added logic to decide whether I could prune it out. Nonetheless, all pruning logic written at this point were based on individual past guesses considered apart from other guesses; the deep tree search resulted in inferences across multiple guesses, but none of the pruning did. This was somewhat problematic; still I got enough of a speedup by this (and then by replacing score_partial_guess by a very convoluted incremental system, which kept running hit/miss counts for each old guess vs. the current partial word, and updated them in O(1) as I added a new letter). (I don't have the original deepmoo scores, but they're pretty similar to the ones below.) --------------------------------------------------- Also, around this time (maybe before it), I decided to try to put in trigraph support. I had collected trigraph stats earlier (even gave them away to my friends), but hadn't tried to use them. So I hooked them up. I basically just used them as a binary test: if the current letter to be added made a trigraph that ever appeared, ok; if not, then save it for last at this branch of the tree. The problem with this approach is it made the trigraphs meaningful near the top of the tree (to the left of the word), but not to the right; it would happily explore words with total gobbledygook on the right before fully exploring the non-gobbledygook. Also, it worsened my average guess score (I had 40 7-letter words I used as my testbed) by about 0.4, so I didn't use them. Later it turned out this was due to a bug; I never actually bothered studying the code or the output carefully to make sure it was doing the right thing, since that would have required me making reference to an actual triplet table. --------------------------------------------------- I was studying the output of deepmoo's guesses on a particular word, and was a little surprised at an inference it had made. A little thought led me to this wonderous revelation about something you can do with hits and misses as opposed to bulls/cows (i.e. mastermind-style black/white). Consider a guess like: "EEEEAAO" You can use the number of misses as a binary number to read off exactly which of E, A, and O appear in the word, with no ambiguity! (E.g. 5 misses = 4 misses for the 'E' and 1 miss for the 'O'.) Double whammy: (a) I didn't have to lengthen my opening book much, because my 7+ letter words already only had 4 unique letters each anyway. (b) deepmoo didn't require any special "binary counting" logic; I just updated the opening book, and the perfect-deep-search technology did its magic! This significantly improved my results on 7+ letter words. --------------------------------------------------- Time for better pruning. Now I could go through and accurately determine if certain letters were definitely not present or definitely were present. (This was all in the pruning logic and didn't affect the search results.) Rather than use binary math, I just wrote some code to solve a general integer packing problem, and used that to evaluate whether a given letter was or was not present. (Actually, this isn't quite true; as I recall, I only tested on one of these initially.) At this point, deepmoo was fast enough (some rare worst case 9-letter words had one guess that took ~2 seconds on my P2-400). I had more pruning up my sleeve, but that would come later. --------------------------------------------------- Now I was ready to improve the guess count. For a while it had bugged me that the full opening book was always pursued, even if it was irrelevent. For a word with no repeats, once deepmoo had identified all the required letters, it could stop the opening book and just move on to guessing. Sure, it's useless if the word has repeated letters or uses one of the last letters in the book, but what the heck, it couldn't worsen things. So I wrote special logic to count up 'partially determined' letters in the opening book (if a guess "ETAO" gets 2 misses, we've found 2 letters althought we don't know which they are); each guess is still handled independenly, so I had to require the opening book to have no letters in common (except the last word, which would never affect this decision making). Woo hoo! Big speedup for a lot of 4 and 5 letter words. // opening book char *tables[][20] = { { "ETAO", "INSH", "RDLU", "MPCG", "WYBF", "JKQV", "XZZZ" }, { "ETAOI", "NSHRD", "LUMPC", "GWYBF", "JKQVX", "ZTNDG" }, { "ETAOIX", "NSHRDZ", "LUMPCQ", "GWYBFV", "JKXZQV", }, { "ETTAAAA", "OIINNNN", "SHHRRRR", "DLLUUUU", "MPPCCCC", "GWYBFJK", "VQXXZZZ" }, { "ETTAAAAA", "NOIINNNN", "SHHRRRRR", "DLLUUUUU", "MPPCCCCC", "GWWYYYBB", "FFJJKQXZ" }, { "ETTAAAAAA", "OIINNNNNN", "SHHRRRRRR", "DLLUUUUUU", "MPPCCCCCC", "GWWYYYYBB", "FJJKKKQXZ" }, }; (You can see I deviate from binary after five guesses--I lose on words which use the later letters, but save on words-with-repeats that don't, because I get to "early out" in detecting that none of them appear. deepmoo is smart enough that that's not that big a deal; it can pick up the unresolved letters at the same time that it's "anagramming".) At this point (it was never important before) it's important that the opening book be in letter frequency order. Although I never got around to generating the frequencies past "ETAOINSHRDLU", and just faked them. In fact, none of the above had really been tuned, especially 5 and 6 letter words. Results: buzzard6 deepmoo BABY 11 11 STATE 11 9 CHORAL 11 9 ENVIOUS 13 13 ACQUIRES 13 HIERARCHY 19 FLOOD 12 12 RHYTHM 13 13 STEROID 14 10 QUIZZES 12 DANGEROUS 12 SYZYGY 12 10 Here's the run of deepmoo on HIERARCHY. The opening book is 7 guesses. As you can see, it knows exactly what letters there are in the word at guess 8. You can also see that deepmoo pretty much sucks at this sort of pure endgame--with three repeated letters, it takes a long time to collect the information. Guessing 'HIERARCHY' 1 ETTAAAAAA 1 2 2 OIINNNNNN 1 7 3 SHHRRRRRR 2 1 4 DLLUUUUUU 0 9 5 MPPCCCCCC 1 3 6 GWWYYYYBB 0 5 7 FJJKKKQXZ 0 9 8 CEIARRCYH 2 0 9 CIEAHCRRY 3 0 10 CIERCAHYR 3 0 11 CIYEARHRC 3 0 12 CIYRRCEHA 3 0 13 CIHERAICY 2 0 14 CHIRACHEY 3 0 15 CYIRHAERC 1 0 16 AIERARCHY 8 0 17 RIERARCHY 8 0 18 IIERARCHY 8 0 19 HIERARCHY 9 0 You can also see the stupidity at the end, but I honestly have *no* clue of any strategy that can better resolve this situation if you're guessing purely abstract strings. (This seems to be the big limitation of hits/misses instead of bulls/cows). Clearly, it could do better if it made use of knowledge of the english language. Hey, and AIE and IIE are pretty unlikely trigraphs... --------------------------------------------------- I went back to my trigraph code, and found a bug! Most of my array accesses are wrapped in macros (e.g. REJECT, DISALLOW) which automatically convert from uppercase letters to array indices. The trigraph tables didn't have any macros, and I was accessing them without the necessary subtract, so it was indexing randomly into the table, effectively randomizing the search order (and hence worsening the results). Fixing the bug gave me an average savings of 1.0 per word for my 7-letter test words! But I had to revert to an old version of the program before I added the early-short-circuiting of the opening book, because I had deleted the trigraph code from my code... buzzard6 deepmoo trigraph improvements: BABY 11 11 9 STATE 11 9 8 CHORAL 11 9 ENVIOUS 13 13 11 ACQUIRES 13 11 HIERARCHY 19 12 FLOOD 12 12 9 RHYTHM 13 13 11 STEROID 14 10 QUIZZES 12 11 DANGEROUS 12 SYZYGY 12 10 7 Omitted entries performed worse with the revised program, because they have no repeats and early-out in deepmoo, but not in the trigraph-enabled program. At the moment, my intention is to just *only* guess trigraph-allowed words, rather than having the top-of-the-tree vs. bottom-of-the-tree problem. If I want completeness (e.g. guessing strings that weren't in my training dictionary), I can do a full traversal of all strings after exhausting all valid combinations of trigraphs. I'm going to upgrade to quadgraphs with an expectation of achieving inhuman performance in return. I may even be able to shorten the opening book that way. --------------------------------------------------- Feb 15, 1999 I spent a long time exploring the notion of "wordspace". It's illegal for us to use a dictionary. It's not illegal for us to use digraphs or trigraphs or quadgraphs, though, is it? (Well, quadgraphs for 4-letter words would be a dictionary!) I don't know how to reconcile this, since any restrictions on the "set of words the program might generate" could be interpreted as a dictionary. And the workaround (explore the "dictionary" first, then fallback on the full set of strings) prevents that definition from literally applying. Even just applying a little bit of smarts (you can't have the same three letters appear in a row) restricts the wordspace, so it doesn't seem fair to define a dictionary as a restriction of a wordspace. But what if I lossily compress a dictionary so that the wordspace I pursue is the original dictionary plus another 10% of garbage words or so? What if it's 100% bigger? What if it's 10 times as big? What if it's 100 times as big? Here's information about some "restricted wordspaces" or "synthetic dictionaries" for 9-letter words: N 5,429,503,678,976 Strings from AAAAAAAAA...ZZZZZZZZZ 184,511,547,557 Three adjacent trigs (using 5693 found in the dictionary) 62,902,335,833 Three adjacent trigs (using 3977 found in 9 letter words) 1,832,210,380 Three adjacent trigs (1289 1st, 1988 2nd, 715 3rd) 1,191,748,033 All 7 trigraphs valid (of 3977) 100,129,697 All 7 trigs valid for that position 65,278,757 All 6 quadgraphs valid (of 12580) 3,813,306 All 6 quads positionally valid from full dict 449,505 All 6 quads positionally valid from 9-word dict 6,475 9-letter words in the computer dictionary I've constructed Consider now that deepmoo was originally searching the full space, but with letter-frequency biases. My recent trigraph change moved the search space down to the second line. The second-to-last line is less than 100x as big as the original dictionary. It is 40,000x smaller than the current space explored by deepmoo. Unfortunately, it requires quadgraph stats uniquely for each position in the word, for each length of word. That's 21*26*26*26*26 bits (over 1M of data). There's a lot of redundancy in that data, if you know how it's substructured; by encoding them relative to trigraphs, and encoding the trigraphs relative to digraphs, I've managed to pack all of that into about 80K. Of course, the original dictionary is about 300K and compresses with PKZIP to 100K, so we've lost a lot of information; clearly simply compressing the dictionary would be simpler. I would create a fictional smaller dictionary which produced the identical (or slightly lossy) stats, if that were feasible in the space required, but it wouldn't actually be small enough, and it might be interpreted by the contest moderator as a violation of the rules (even though it would really just be a way of encoding the stats). I can probably manage to encode the second line up (the ~4M 'wordspace'). It saddens me to not get the brutally tight 449,505 "words", but oh well. I printed those 449,505 "words" to a file; here's a sample: abandabad abaterful abhoriton abomissen absordine aestinels amortiest appartina assurrier awartency bariality beholatti betranges blateware boycotted bufferser bumberade bungenity burgative burrifier buttentic bystament cabitiary calipheme calloidal camburson camperity candacist candwagor canterina capricial carcasist carditory carennell Sure, they all look pretty silly, but imagine a moo-playing program all of whose guesses looked like that. You would have to grant that while it didn't really know English (the way a human actually knows a vocabulary), it sure knows how to construct words that fit the patterns of English. In fact, consider this: Once you know that the only letters in the word are the letters from 'HIERARCHY' (which it did after 7 guesses), there are only 76 "words" in this wordspace which fit that constraint and the 449,505 unique-position quadgraph technique: accracier arachaire archarier archerary archerche archerchy archerich archerier carcerich carcerier carchaire carracier carriarch carricacy carricher characier cherarche cherarchy cherarier chericacy chericher chicherer chicheria chicheric chicherie crachaire creachere creachery creachier hairchere hairchery hairchier harachere harachery harachier harracier harriarch harricacy harricher hearchere hearchery hearchier herachere herachery herachier hercerich hercerier herchaire herichier hierarche hierarchy hierarier reachaire rearchere rearchery rearchier recharier recherary recherche recherchy recherich recherier rehearier richerary richerche richerchy richerich richerier yacherary yacherche yacherchy yacherich yacherier yearchere yearchery yearchier I have a feeling deepmoo would be able to solve this in about two or three guesses tops (after the opening book). Sadly, as I said, I can't find any way to fit the necessary positional quadgraph information into the allowed space. Oh well. --------------------------------------------------- 2-19-98 Positional quadgraphs that aren't unique to each length (shared between differing lengths, but unique to each position) totally rock. deepmoo can now process about 100 words per second, so it'll take five or six minutes to process my entire dictionary. The speed up is because using quadgraphs so enormously trims the search space, the search goes much faster. With this much speed, I can tune the opening book to the entire dictionary! In fact, it goes so fast that I added a (noticeably slower) mode which, during the opening book, goes ahead and scans the search space to see if there's only one constructible word that matches the current opening book; if so, it aborts out even earlier than normal. This is only a small savings over the basic positional quadgraph code, though; the latter of which saved around 30% of guesses on average. Unfortunately, I only get 25K of storage, and I can't get the positional quadgraphs down below 40-60K. Current results; improvements are marked with a *; I expect 4-letter words will do better with custom trigraph support. (Need to avoid using a list of '4-letter strings' for 4-letter words.) older version without short-circuiting opening book | previous (non-trigraph) version | | quadgraph verison | | | BABY 11 11 10* STATE 9 9 7* CHORAL 11 9* 7* ENVIOUS 13 13 8* ACQUIRES 13 13 9* HIERARCHY 19 19 8** FLOOD 12 12 8* RHYTHM 13 13 8* STEROID 13 10* 6* QUIZZES 12 12 8* DANGEROUS 13 12* 7* SYZYGY 10 10 6* The aforementioned "abort the book early if only one word fits" seems to fire about 1 out of 20 words, but none of these words did. Next up is figuring out how close to that I can get within the space limitations, and then tuning the opening book. Here are some actual games that it plays. Note that on many, it simply guesses correctly on the first guess after it stops the opening book (which suggests I may want to shorten the opening book); I've marked the first non-opening book guess with a '*'. Guessing 'STATE' Guessing 'CHORAL' 1 ETAOI 2 2 1 ETAOIX 0 4 2 NSHRD 0 4 2 NSHRDZ 1 4 3 LUMPC 0 5 3 LUMPCQ 0 4 4 GWYBF 0 5 *4 SECROP 1 3 5 JKQVX 0 5 5 SOULDE 0 4 6 ZTNDG 1 4 6 CAIRLS 2 2 *7 STATE 5 0 7 CHORAL 6 0 Guessing 'ACQUIRES' Guessing 'HIERARCHY' 1 ETTAAAAA 0 2 1 ETTAAAAAA 1 2 2 NOIINNNN 0 6 2 OIINNNNNN 1 7 3 SHHRRRRR 1 2 3 SHHRRRRRR 2 1 4 DLLUUUUU 1 3 4 DLLUUUUUU 0 9 5 MPPCCCCC 0 3 5 MPPCCCCCC 1 3 6 GWWYYYBB 0 8 6 GWWYYYYBB 0 5 7 FFJJKQXZ 0 7 7 FJJKKKQXZ 0 9 *8 CASKERIU 1 1 *8 HIERARCHY 9 0 9 ACQUIRES 8 0 Guessing 'QUIZZES' Guessing 'SYZYGY' 1 ETTAAAA 0 6 1 ETAOIX 0 6 2 OIINNNN 1 5 2 NSHRDZ 0 4 3 SHHRRRR 0 6 3 LUMPCQ 0 6 4 DLLUUUU 0 3 4 GWYBFV 0 4 5 MPPCCCC 0 7 5 JKXZQV 0 5 6 GWYBFJK 0 7 *6 SYZYGY 6 0 7 VQXXZZZ 1 3 *8 QUIZZES 7 0 --------------------------------------------------- Feb 23, 1999 No positive data to report. The smallest I've managed to compress the data down to is about 40K, which zip has managed to reduce to something like 24782 bytes, but this isn't really meaningful since those are 8-bit bytes, and I need to code to 6-bit bytes. (The 40K is using 6-bit bytes; zip's compression scheme is good enough to compress random 6-bit data down by 1/4.) I came up with the original scheme for compressing quadgraphs way back when I was originally playing with trigraphs. Let's start with trigraphs: the naive approach is to notice that some rows of the table (a giant 3-dimensional 26x26x26 table) are all 0s, and omit them. Interestingly, the list of which rows are non-empty is the list of which rows for which there is a valid trigraph like, say, 'kc*', where this is the 'kc' row. If you think about it, 'kc' is *only* going to be a valid digraph if that row is not empty; it's an 'if-and-only-if' situation. Now, for non-empty rows, you still have to encode all 26 entries, which is kind of wasteful. But, hey, there's symmetry here. 'kca' is only a valid trigraph if 'kc' is a valid digraph, and 'ca' is a valid digraph! Moreover, this works in a cool positional way. 'kca' can only be a valid trigraph in the fourth position if 'kc' is a valid digraph in the fourth position, and 'ca' is a valid digraph in the fifth position. So, if we know the valid digraphs, we can easily cull out a bunch of trigraphs. Now, anytime we get a combination that looks valid from the digraphs, e.g. 'kca', we still have to store a bit which determines whether this is *truly* a valid trigraph. (It is, because of 'bookcase', although it kinda sucks when you generate a five-letter word ending in 'okc', or a six-letter word endning in 'kca', which is why ideally you'd separate out this data for each word length, but that's just WAY TOO MUCH data.) So, quadgraph compression can naturally build on trigraph compression in the same way. BOOK = BOO + OOK OOKC = OOK + OKC etc. BOOKCASE BOOKCASE BOOKCASE BO BOO BOOK OO OOK OOKC OK OKC OKCA KC KCA KCAS CA CAS CASE AS ASE SE Now, compressing things this way, I'm culling out a lot of 0's, that is, a lot of non-quadgraphs. But I'm not culling out any 1's; I still have at least 1 bit per original 1 bit. Is this problematic? The data that I'm storing (6*26*26*26*26 quadgraph bits) is about 2.4M bits. Of those bits, about 41K of them are non-zero. I get 6 bits per character in source code, so just counting the 1 bits, I'm allowed 3 bits per 1 bit. Tight, but not infeasible. The above encoding, however, is not tight enough. It averages around 4 bits per 1 bit, *plus* I have to store all the trigraph and digraph tables, which add up. So, here are some clever schemes. Up until this point, I was encoding the quadgraphs based on trigraphs generated from the original data, not from trigraphs based on the quadgraphs themselves. (They would be basically the same.) If I want to merge some quadgraph data together, e.g. use the same quadgraph info for two positions, I'll need to similarly merge the trigraph info. It might be better to simply independently generate the trigraph info needed for each quadgraph. Anyway, the quadgraphs were still just too sparse. Then I had this clever idea. Currently, I was storing a bit for 'OKCA' iff 'OKC' was a trigraph in the first position, and 'KCA' was a trigraph in the second position. Let's write these trigraphs relative to the quadgraph, for a moment. I had a trigraph 'OKC*', and a trigraph '*KCA'; therefore, a quadgraph 'OKCA' was also plausible. Here's the sneaky trick: maybe I could store "trigraphs" of the form "OK*A" and "O*CA" as well; and only if ALL FOUR are valid do I encode a bit. This will let me squeeze out even more 0 bits; though of course every 1 bit will still need a 1 bit. Guess what? It worked! I got the fully-positional quadgraph data down to about 18K of source! zip compresses it to 13K! Except... All of those new pseudo-trigraphs aren't shared between multiple quads, so there's now 3 times as many trigraphs. Additionally, the new trigraphs, which have "holes" in them, not longer represent sound linguistic patterns. "SYZYGY" generates "S*ZY" and "Y*YG" and all sorts of bizarre "trigraphs" that themselves don't compress very well. The best results I've gotten are to use one or the other (i.e. "O*CA" or "OK*A" but not both); this squeezes the quadgraphs down without adding *too* much trigraph data. This produces the aforementioned 40K, 25K zipped.