I define interactivity as “A cyclic process in which two agents alternately listen, think, and speak.” This definition uses the terminology of a conversation, which is no accident: conversation is far and away the most common form of interaction. Language is the most common medium of interaction. It is pernicious that we use language so rarely in interacting with computers. Fortunately, this is a temporary problem; I have no doubt that human-computer interaction will eventually take place primarily via language. Here’s the proof:
For now, though, we have a ridiculous system of buttons, menus, sliders, and other paraphernalia of the graphical user interface. At least it’s better than the more ridiculous command line interfaces of ancient times (still used by many programmers) and the even more ridiculous binary switch inputs of the early microcomputer years.
But the graphical user input systems used in games are incompetent for use in interactive storytelling. I did some research a few decades ago suggesting that graphical user interfaces start losing their effectiveness when the application needs more than about a hundred verbs. I have also estimated that an interactive storyworld will need at least 100 verbs to be effective. In other words, GUIs are inherantly incompetent for interactive storytelling. We require the use of language for interactive storytelling.
Of course, we’ll never get genuine language comprehension from a computer, because language mirrors reality. The language we use reflects the reality in which we live.
This is easiest to see in the vocabulary of peoples living in different environments. For example, desert dwellers have few words for frozen water. The Inuit people of Arctic regions can assemble a great many words to describe frozen water in its many different manifestations.
Language is always in flux. Nowadays, not many people say, “keen-o!”, “far out!”, or “rad!” to express appreciation, nor do many people use “bread”, “moolah”, or “green” to describe money. The news spawns all manner of short-lived terms reflecting the changing world. No computer can keep up with all that change unless it hangs around the water cooler with everybody else. (Do you think that a computer could ever understand the previous sentence?) Or here’s a sentence that is far beyond the ken of any computer for the foreseeable future:
“When I finished a business trip prematurely and got home at 3:00 AM, and saw my best friend’s car parked in my driveway, I resolved to get my gun on the way to the bedroom.”
Think about the information about human behavior that a computer would need to comprehend that sentence.
Still, we can get away with using a simple version of natural language with our computers. We can ask our smartphones for the nearest Thai restaurant, but not why politics is such a mess. The problem is especially difficult with interactive storytelling because drama requires emotional and social interactions, which are particularly complex.
Fortunately, the solution has been with us from the beginning of computers. We don’t try to build reality into our computers; we instead build a toy version of reality. In games, people don’t pause to go to the bathroom, or trip over junk on the floor, or stop to tie their shoes. We strip away all the complexities of reality and zero in on a simplified reality that focuses the player on the gameplay. We must do the same thing with interactive storytelling. We must build simplified languages that address only the interactions we want our players to experience.
Deikto
My general solution to the problem of language is Deikto, a kind of template for language. It strips away much of the complexity of natural language but retains the fundamental expressive capabilities. The basic design principles for Deikto are:
Vocabulary size
In the early years, computers used command line interfaces. The user would type in a cryptic set of letters that would mean something to the computer. The computer would parse the command and figure out what to do. All too often, a misplaced letter or punctuation mark would generate a syntax error. Worse, the user had to memorize all the commands as well as their syntax structures. If you were a professional programmer, memorizing all that crap was part of developing your expertise. If you just wanted to use the damn computer, it was a pain in the butt. In general, command line interfaces had only a few dozen verbs; it was just too difficult to memorize many more.
The graphical user interface invented at Xerox PARC and popularized by the Macintosh opened up new possibilities. Its graphical structure made syntax errors impossible, and it could handle up to about a hundred verbs fairly easily. Graphical user interfaces reached their upper limits at around 300 verbs.
Interactive storytelling will require us to move to larger vocabularies, for which only a linguistic user interface can suffice.
Deikto is a metalanguage, not a language
Deikto is a system for creating a language. Deikto itself contains no verbs, no nouns, nothing; those are created by the storyworld author inside Deikto. Deikto defines a very simple form of language in which the verb is the central component. Every verb can take up to fifteen additional words. These could be direct objects, indirect objects, prepositional phrases, adverbial phrases, and so forth. However, all of these components must fit into a few narrowly defined types of words: Actors, Verbs, Stages, Props, Actor Traits, Verb Traits, Prop Traits, Stage Traits, and Quantifiers. Much of what natural language does with prepositional and adverbial phrases can be handled with this system.
Symmetric expression
Our interaction with computers is asymmetric: we speak to the computer through keyboard and mouse, and it speaks back to us with images and sounds. This is both unnatural and wasteful. It’s rather like having a conversation with somebody in which you speak English and they respond in Spanish. You must each learn two languages. We’ve been stuck with this dumb arrangement because our computers are too stupid to handle linguistic user interfaces. Nowadays, though, computers are getting to be smart enough to handle simple linguistic processing. It’s time to make the move from graphical user interfaces to linguistic user interfaces.
You might wonder how we can obtain symmetry when the computer and the human are so different. We use eyes and ears; the computer uses keyboards and mice. The intersection of these two sensory system is the image on the screen. The computer displays images and we click on them with our mice. We can use menu systems to permit fairly rich interactions. Here’s a screen shot of the Siboot storyworld, which uses an iconic language:
In the upper panel, the actor Zubi has just greeted the player (the lower character) with much sincerity (the green icon means honesty or sincerity, and the three-lobed icon is a quantifier, in this case having a value of +2 on a scale from -3 to +3).
In the lower panel, the player has begun her response by offering gossip to Zubi about somebody. Here we see the menu of possible people that the player could offer gossip about. Each of the actors about whom the player could gossip is represented by an icon. The player need only click on an icon to select the actor about whom to gossip. As the player builds the sentence, additional menus appear automatically. Here’s a complete sequence of menus showing how a long sentence is built:
Camiggdo must decide what to do.
Camiggdo has six verbs from which to choose.
Camiggdo decides to gossip to Zubi. Note how the language automatically fills in the Direct Object, Zubi. That’s because there’s nobody else to gossip to, so the language doesn’t bother giving the player a menu with just one item; it automatically selects that one item for the player. Now Camiggdo but decide whom to gossip about.
Camiggdo must choose from one of these three actors. The language does not permit Camiggdo to gossip about either herself or Zubi, because that would make no sense in this situation. Only these three actors are “Acceptable”.
Camiggdo decides to gossip about Skordokott. But now she must decide what personality trait she’ll gossip about.
There are three personality traits to gossip about: Timid_Dominant (red), Faithless_Honest (green), or Bad_Good (blue).
Camiggdo decides to gossip about Skordokott’s Faithless_Honest trait. Now she must declare its magnitude.
Here Camiggdo gets to choose among seven quantifiers, ranging from -3 through 0 to +3. They really should be thought of not as numbers but as phrases: “very negative”, “negative”, “slightly negative”, “zero”, “slightly positive”, “positive”, “very positive”.
She decides to say that Skordokott’s Faithless_Honest value is “negative”. But there’s still one more word she must add to complete the sentence:
This word represents how confident she is of her assessment. Perhaps she doesn’t want to make too firm a statement if it comes back to her later. These five words present five levels of confidence she wishes to declare in her statement about Skordokott.
This is the complete sentence. It says “Camiggdo tells Zubi that Skordokott has negative Faithless_Honest, and Camiggdo is fairly certain of this.”
This demonstrates a number of features of the Deikto system: the symmetry of expression (the computer speaks to the user in the same language that the user uses to speak to the computer); the use of menus; the menus are smart—they automatically prune out menu entries that are not acceptable under the immediate circumstances. This is done using the “Acceptable” feature of the Sappho scripting language. Only words that are calculated in a Sappho script to be Acceptable show up in the menus.
Display options
The Deikto system does not presume the graphical system used to present the language. The above example used icons. Any system of icons could be used to represent words. Deikto can just as easily use text for its sentences. I developed a nice technology for this that I call “tinkertoy text”. I have written about it here. Sappho has a bunch of nice operators for customizing tinkertoy text.
Shortcomings of Deikto
Deikto is meant to provide a simplified version of natural language; therefore, it cannot do everything that natural language can do. Its greatest shortcoming is its inability to handle recursion, as in this sentence:
The butcher shouted at the dog who stole the steak that the baker had just gotten from the delivery boy who was late.
In general, Deikto cannot handle multiple clauses. {Correction, July 11th, 2020: I am developing a system that will permit Deikto sentences to handle multiple clauses. See “Nesting Clauses”} However, it can nest clauses that are explicitly specified by the main verb. For example, the verb ‘promise’ could be set up with word sockets to include a second verb and an object, like so:
{Subject} {promises that he will do} {verb} {using prop}.
However, it usually makes more sense to conflate the second verb with the first, like so:
{Subject} {promises to give} {Direct Object} {prop}.
{Subject} {promises to go to} {stage}.
{Subject} {promises to tell} {Direct Object} {quantifier} {Actor Trait} of {Actor}
Deikto also cannot handle adjectival or adverbial modification in any manner other than intensification through quantifiers. For example, Deikto can adverbially modify the verb ‘throw’ with quantifiers denoting how fast the projectile is thrown. It can adjectivally modify traits (actor, stage, or prop) that are scalar values, such as weight, size, height, value, and so forth. Again, quantifiers permit such denotations. Examples:
{Subject} {gives} {Direct Object} {prop} {prop trait} {magnitude}.
Joe gives Mary the book expensive slightly.
{Subject} {punches} {Direct Object} {magnitude}.
Joe punches Fred hard.
However, it cannot modify anything by style or type or any other non-scalar dimension. Examples:
John kisses Mary {passionately|on the cheek|quickly}
John questions Mary {curiously | suspiciously | lazily}
Mary tosses the ball to John {angrily | insouciantly | laughingly}
Natural Language and the Future
There’s no question as to the future of language interaction in interactive storytelling: natural language is the way to go. Natural language processing is already impressive; products like Siri, Alexa, and Cortana demonstrate impressive capabilities. In the field of interactive storytelling, Spirit AI has an impressive natural language system built into its Character Engine. There’s no question that someday these efforts will be usable for interactive storytelling, and when that day comes, hacks like Deikto will be obsolete.
The problem is that “someday” isn’t today. These products aren’t yet good enough. They are certainly good enough to recognize words enunciated by different speakers, and they also do a good job of making sense of many sentences. Spirit AI can even recognized emotional intonation, a crucial step forward. The killer problem lies in the verbs.
It’s not too difficult to equip a natural language system with a fully operational dictionary of nouns, pronouns, adjectives, and prepositions. These words are all readily definable in computational terms because they are ultimately about Objects. But verbs require Process-definitions. A verb must have algorithms defined for it. Verbs describing the manipulation of Objects are easy to handle: you simply define what attributes of the Object are changed by the execution of the verb. But when we get down to the core problem—dramatic interaction between actors—the Processes that we must define are not easily reducible to algorithms. Certainly no standard algorithms spanning all possible situations can exist. Therefore, those algorithms will have to be created individually, not stamped out. That’s the nature of art.
I expect that the future of interactive storytelling will indeed utilize natural language systems, with two constraints:
1. The verbs available to the user will be constrained; the user will be prompted as to verbs that are appropriate.
2. The algorithms defining the behavior of each verb will have to be algorithmically defined by the artist.