Synthetic Creations

Synthetic Shakespeare
Some years back I saw an article in Scientific American that triggered some thoughts about algorithms. The article was about an experiment that somebody performed with the works of William Shakespeare. They assembled a great deal of Shakespeare’s text in a huge file. Then they built frequency tables for each of the characters in the alphabet. Thus, they found that the letter "e" occurred x% of the time, and the letter "z" occurred y% of the time, and so on for each letter. Since they counted spaces and common grammatical symbols, they ended up with, say, 32 entries in their frequency table.

The next step was to prepare a frequency table for each pair of entries in the first table. Thus, they counted the number of times that the pair "aa" occurred (very rare, that pair), how many times "ab" occurred, and so on through "zz" and the various grammatical symbols. This table would have been 1024 entries long.

Next they prepared a frequency table for triplets, 32K entries long. And then one for quadruplets, one million entries long. Then followed even bigger frequency tables -- I think that they went as high as octuplets, but this would have required a staggering amount of storage space -- one terabyte!

Once all these frequency tables had been built, the researchers were in a position to turn them around. A sequence of random numbers could be used to select letters at random, but the choice of letters would be weighted by the frequency tables. For example, suppose that we have already built the string "bal". We must now choose the next character. Since our string is only three characters long so far, we consult the table of frequencies of quadruplets. We find absolutely zero instances of the strings "balq", "balx", "balz"... but we do find instances of "bala", "balb", "balc", "bald", "bale", "balf", and so on. Thus, we might choose any letter from a through f, but we will surely not choose a q, x, or z.

Note that this technique is not restricted to single words; entire sentences can be built using the method. After all, the end-of-word character (a space) has its own frequency tables. Thus, the substrings "ble", "tion", and "ing" will have a higher frequency of being followed by a space.

The researchers found that strings built by frequency tables using these methods were remarkably "Shakespeare-like". They didn’t make much sense, but they sure had a lot of Elizabethan panache. It reminds me of that hilarious Danny Kaye routine in which he stands up to make a speech at a diplomatic banquet and speaks "synthetic French", stringing together nonsense French-sounding syllables with excited gestures, strong intonation and deep feeling. Remember that the researchers had to use a terabyte of memory to build their frequency tables. Danny Kaye starts to look pretty impressive, doesn’t he?

Synthetic Province Names
So what does this have to do with games? There are a number of applications. I used the method to create random province names for Guns & Butter. Often computer games need to generate lots of names for islands, countries, planets, ships, people, or whatever. Most of the time we simply store a few score names in a big table and then randomly choose names from the table. This works, I suppose, but the procedure described above makes possible better results.

In the case of
Guns & Butter, I started off with a list of Mongol and Turkish names. It was easy to get; I hired somebody to type in all the names from a huge index in a scholarly book about Central Asia. At runtime, I read in all the names and build my frequency tables in RAM. I only went as far as letter triplets. This was an easy decision. Quadruplets would have required two megs of RAM for the frequency table. Triplets required only 64K (and this can be reduced if you restrict yourself to just the 26 letters of the alphabet and a space character then the total drops to under 40K.)

With my frequency tables all set up, it was an easy matter to spew out names. The routine for the whole job (reading the file, building the frequency tables, and generating the names) required just 150 lines of Pascal. Piece o’ cake.

How good are the results? All in all, passable. By resorting to an exotic language phoneme, I was able to get away with minor flaws. If real Mongol-speakers were to look at my province names, they would surely gag. But to an English speaker, they have a consistent flavor that seems believable. And in looking at thousands of generated province names, I have noticed only a few duplications. Here are some of the names generated by my algorithm:

Kabinongot, Yushin, Kamangri, Bakirksh, Ogalymamus, Malda, Tuna, Turte, Bogdigizer, Jakset, Murbihuzde, Imuji, Aniboorkha, Vigizirgin, Drot, Shaisakhat, Kerigin, Toqusecbad, Yumyshuksa, Yatoquc, Botoqtambu, Mohi, Tushai, Vogarnis.

This list may seem unnoteworthy. But look closely; it has some impressive traits. First, all the words work; they are all pronounceable. They all have reasonable ratios of consonants to verbs, in roughly the correct relationships. They are all of roughly the correct length. Lastly, they all share a Mongol-Turkish flavor.

I now realize that the algorithm could have been improved by adding reverse frequency calculations. My original algorithm calculated only forward frequencies. That is, if I had the string "bal", I would determine the next character by consulting the frequency table for the string "al?", where "?" denotes one of the 27 characters that I might append to the string. This created a problem with excessively long words; occasionally the algorithm would generate words of unacceptable length. To counter this, I simply cut the algorithm off at ten characters.

A better way would have been to look ahead once the string reached eight characters in length. The improved algorithm would ask, "What three-letter string ending in a space can be matched to the end of the word we are buidling?" In other words, instead of looking for frequency table entries of the form XX?, we look for frequency table entries of the form ?XX. It’s a reverse match instead of a forward match.

An even better algorithm would have given equal weight to the reverse algorithm, with the forward algorithm dominating the first part of the word, the reverse algorithm dominating the end of the word, and the two algorithms gray-merging in the middle.

A small problem arose with the possibility of vulgar, obscene, or racist terms being generated. I had to insert a routine that checked the output against the inclusion of substrings that might be offensive to some of my players. Unfortunately, the list of potentially sensitive substrings is long, and I checked only the most serious. Someday I’ll get an outraged letter...

The use of an exotic language phoneme covered many flaws. I tried to use English countryside place names, and the results were dismal. When "Warwickshire" comes out as "Sharewarckwir", well, it just doesn’t sound right. I didn’t try to do anything with French or German names, but it should have worked.

This scheme need not be restricted to the creation of names. It can be applied to any set of sequences of items chosen from a small set. Music springs to mind. If you treat music as sequences of notes, then you could build frequency tables for all the symphonies of, say, Beethoven, and then spit out "synthesized Beethoven". Exciting as this may sound, it probably won’t work. I spoke with Tod Rundgren about it and he pointed out that the notes are a small part of the total information content of a musical score. Moreover, with multiple notes playing simultaneously, the problem very quickly becomes very complex. According to Tod, such schemes have been tried with little success.

There are other possibilities. Here are some silly ones. The object code of a computer program might provide us with the raw material for "synthetic software" wouldn’t that be fun? Or perhaps we could create "synthetic intelligence" by recording all the moves of an expert player of a game over many playings, building frequency tables, and then synthesizing move sequences.

The concept can be extended into two dimensions to build two-dimensional frequency tables for certain types of images. Perhaps we could synthesize images of a given class from a frequency table. I suspect that for this to work, the images will have to be represented as sequences of constructs at a higher level than mere pixels. A binary frequency table offers no real information content. Visual structures with repetitive elements (maps, for example) can be synthesized if we choose the right kind of "visual atoms" for which to build our frequency tables.