Fundamental to the interactive cycle is the notion of segmentation. The interaction is broken down into a sequence of steps. During each step, the user listens to what the machine has to say, thinks about it, and expresses his reaction back to the machine. Thus, the interaction proceeds in a series of discrete steps. The purpose of this essay is to focus on the problems of connecting the steps. We can refer to these steps as "segments" or, to use game terminology, "turns." This problem raises all sorts of secondary issues, so let me define it carefully.
At any point in the experience, we find ourselves at a node in the gametree. A number of options lie before us. We must choose an option and jump to the node indicated by that option. There are two fundamental issues associated with this process that have profound implications for the design of interactive entertainment: time cost and segment joining.
The time cost of choice
The first of these arises from the act of choice. Some portion of the decisions (presumably half) will be made by the audience. The critical observation is that this process takes time. The audience needs to think over its options before committing to a course of action.
We could, of course, refuse to allow the audience any time to think things over. After all, this makes our job easier in so many ways. We don’t have to worry about how long the audience takes; the product just marches forward without them. This is the standard approach used in most skill & action games. The action keeps on going with or without the player.
This design approach has a serious drawback: it limits the challenge to problems that the audience can solve in a fraction of a second. Such problems tend to be simple matters of geometric relationships. Can I make it out the door before the bomb explodes? If I take a running start, can I jump over the chasm? Such decisions fall low on the hierarchy of human thought. They do not challenge the most interesting parts of our minds. Most adults find such problems uninteresting -- which is why skill & action games are appreciated primarily by children.
It is possible to soften the edge of this problem by granting the audience a bit more time. We slow the clock down, put in a delay, or otherwise extend the amount of time permitted the player. But this is only a feeble amelioration, not a true solution, for the amount of time necessary to truly engage our most human mental faculties is variable and can easily extend up to several minutes per decision. Granting a few extra seconds still confines our designs to relatively simple challenges.
The only way to assure that the audience has all the time it needs to tackle problems that truly challenge the full range of faculties is to give the audience as much time as it needs. This is the classic turn-sequenced approach to game design. The player of such a game takes as long as he wishes, then announces the completion of his turn. This solution, however, introduces a new problem. It destroys any hope of a smooth or steady devolution in the progress of the experience. Instead, the entertainment proceeds in a jerky fashion, moving swiftly while the ball is in the computer’s court, then grinding to a complete halt when the ball moves to the human’s court.
Thus, we have a fundamental issue here: realtime action is inimical to full human engagement. If we want to give the player challenges that require his fullest and most human involvement, we have to give him the time to bring his full character and personality to bear.
The problem is exacerbated by the clumsy nature of our input structures. It is entirely possible that a system with voice recognition and natural language processing could permit the player to work in realtime with emotionally challenging material, but when we force the player to manipulate a clumsy and indirect user interface to get anything done, any thought of maintaining realtime action goes out the window. We can’t expect smooth realtime play until we have a smooth realtime user interface.
Mating Segments
The second nasty problem that arises with segmentation is the mating of the result with the previous node. Here’s the audience, having arrived at node X, and it must now choose between options A, B, C, and D. It settles on option C. Somehow, the designer must insure that the presentation of option C flows smoothly from the presentation of node X. This requirement (and the previous one) will make the much-anticipated use of video in interactive entertainment far more difficult than many people imagine.
Examples
Because this problem is so esoteric, I will use an example to demonstrate the nature of the problems. Suppose that we have an interactive entertainment in which the audience is playing "Gustav", a proud and handsome aristocrat, who comes home from the wars early, bounds into the bedroom, and discovers his wife in bed with his best friend. Does Gustav A) slump his shoulders and retreat in tears; B) coldly dismiss the other man before confronting his wife; C) fly into a towering rage and shoot both of them?
Suppose now that we were to present this situation, the options, and results in pure text. The audience sees nothing but words on the screen. How utterly boring! How technologically backward! How videoly poverty-stricken! But note this: there’s no problem with the timing or with the mating. The delay while the audience thinks over the situation causes no embarrassment or clumsiness. Whatever option the audience chooses, it mates perfectly smoothly with the precipitating situation. The use of text causes no segmentation problems whatever.
We could also use a "comic strip" approach to solve the problem. Note that comics share with text a well-segmented structure. The story proceeds in a step-by-step fashion that lends itself to the kind of segmentation that an interactive application needs. Thus, we could present static images mixed with text and the result would still work quite well.
Now let’s try the same thing with video. The audience sees Gustav riding up to the manor on his white horse as stirring music plays. He leaps off the horse and bounds up the stairs, his spurs jangling. The music swells to a crescendo as he arrives at the bedroom entry and throws open the ornate double doors then crashes with doom as he beholds his surprised wife in the arms of his best friend.
Wow, that was impressive. Now what do we do? How do we present Gustav’s options? I suppose that we could use text in exactly the same way that we do with the pure text version. The problem with this is, it creates a disjunction with the video presentation. We saw all that exciting imagery, heard all that great music, and suddenly everything stops and we’re staring at this stupid text display. The text only serves to take us out of the fantasy, to remind us that we are not really Gustav. "We now interrupt this fantasy to give you an opportunity to make a decision." The interactive side of the entertainment does as much for the fantasy as a commercial does for a TV show.
At this point one might argue that the use of text here is exactly the same as if we were to use a pure text system throughout. Thus, the argument goes, if it works in a pure text environment, it should work just as well in a mixed environment with both video and text.
I reject this reasoning. The mixed environment suffers from inconsistency. A pure text environment sets up audience expectations and a particular fantasy environment, and it maintains those expectations and that environment with consistency. A mixed video/text environment such as that I have described jerks the audience back and forth between you-are-there imagery and you-are-not-there text. The disjunction is too jarring. Much of the power and immediacy of the video will be contradicted by the indirectness of the text. It can work, but it’ll be clumsy.
OK, if mixed video/text has problems, why not use pure video? We play out the three options in sequence. We see Gustav slump his shoulders and wander off, a broken and betrayed man. We see Gustav’s lips curl with anger as he coldly orders his former best friend out of the room. We see his face contort with rage as he pulls out his gun and shoots the screaming lovers. Then we somehow indicate which of these three scenes we wish to select. Again, this is very impressive. But do we play all three in sequence, in some sort of video menu, allowing the audience to choose the one that seems best? What would happen if there were eight choices? I could scan through eight entries on a text menu in just a few seconds, but a video menu of eight sequences could easily take a full minute to examine. This would be tedious. Moreover, the extended time necessary to review the options would destroy the momentum of the video presentation.
But there’s an even more serious problem with the pure video approach: the cost of creating all that video. In a normal movie, the scriptwriter would have chosen one of the three options in advance, and that option would be the only one filmed. But in this interactive entertainment, we would have to film all three options. The cost for this fragment would be three times the cost of the regular movie. In fact, this problem is even worse; in any real-world interactive entertainment, we would have to film thousands of such sequences, at a cost hundreds of times greater than the cost of a simple movie. An interactive movie with the production values of Jurassic Park would cost billions of dollars to produce. We can’t afford that.
What we need is some scheme for re-using sequences, some clever system that allows us to use the same sequence over and over in different settings. This re-usability of segments is the key to solving a whole range of problems with interacitve entertainments. It is used in the lowliest, simplest videogames by the expedient of defining simple fundamental actions that take place over and over. But video can’t be made re-usable. Imagine the following fragment from an interactive entertainment:
"Gustav’s lips curled in anger. His eyelids narrowed. ’Get out of here!’ he commanded coldly."
This is just option B from our previous set, right? It’s one of his responses to discovering his best friend in the arms of his wife. But and this is a critical observation -- it could also be his reaction to a messenger who has brought him word that the army has surrendered. It could be his reaction to his subordinate telling him that the peasants on his manor want better pay. It could fit into any of a number of scenarios.
But not if it’s done with video. Video is too literal, too precise, too specific to one situation. The backgrounds would be different in the different situations I just described. In one of the situations, his anger is directed at the person he’s speaking to; in another, the person he’s speaking to is completely innocent. Gustav might be wearing a military uniform in one scenario, bedclothes in another, and working clothes in the third. He could be inside a house in one case, in a tent in another, and outside in the third. You can’t do that with video, not without shooting each scene separately. There goes re-usability.
Thus, the visual detail, the sensory precision that video offers makes it impossible to apply it to a variety of circumstances, yet re-usability is the only economically feasible way to provide interactivity.
Incompatibility between video and interactivity
What we see here is an architectural incompatibility between video and interactivity. Video is realtime and continuous; interactivity works best with turn-sequencing and interruptions. Video is specific to the situation; interactivity works best with a generality of expression that can be adapted to the particular situation that the audience has created. Video is the most powerful means by which an artist can express his vision to an audience, but its power is derived from traits that collide with the needs of the interactive environment.
This is truly an ironic conclusion. Most people in the interactive entertainment business have a well-established system of aesthetic priorities that places video in the highest position. Video has earned this position with its vast expressive power. But the irony comes when we mindlessly assume that video must automatically provide the highest and most powerful form of interactivity. In truth, the very traits that give it power and immediacy stand in the way of powerful interaction.
There are, of course, a variety of schemes that address some of these problems. It is important to realize that all of these schemes introduce major constraints on the design. For example, it is possible to organize the video clips in a structure that makes continuous viewing possible -- if the game’s architecture is properly contrived. Sewer Shark uses such a scheme. The player moves through a maze, and video for each segment of the maze plays continuously. By constraining the player’s options to simple left/right decisions at maze junctions, the game can maintain the continuity of the video.
Sewer Sharkus uses another stunt for extending the utility of video: computer-generated overlays mixed with background video. This is a "hamburger helper" approach, not a genuine solution. It stretches out the utility of the video, but it doesn’t solve the basic problem that video is expensive and difficult to re-use. It doesn’t make the video any more re-usable. Moreover, overlays tend to shift the interactivity from the video to the computer-generated graphics. The video becomes little more than an ornate background for an otherwise conventional videogame.
Note that my arguments do not absolutely preclude the use of video in interactive entertainment. Perhaps "incompatible" is too strong a term to describe the problems involved in mixing video with interactivity. I am arguing that video introduces clumsy elements that detract from the power of the interaction. The first problem -- the interruption and disjunction associated with the audience choice -- seems insoluble to me, but I think we can live with it. The second problem -- the over-specificity of video can be partially addressed (but not solved) with a variety of limited solutions for the short term. It is imperative that we keep in mind the realization that use of video will impose severe constraints on the interactivity of a product.
For the long term, this second problem is technically solvable. All we need is the ability to generate video from components, on they fly. In other words, the software calls up a scene to order:
"We need Gustav wearing his riding clothes, mounted on the chestnut charger. Put his manservant and messenger on stage left. The background is a crossroads in forested country, on a day in the early fall, during the afternoon. Gustav is tired and dirty from a long ride. He is in an irritable mood. Now, draw and run that."
When we have computers powerful enough to pull that off, the second problem will vanish. For now, however, the canned video that we use gets in the way of the interaction.
Interestingly, we are able to do precisely this with interactive entertainments that have narrow visual definitions. Flight simulators provide a good example of this phenomenon. The visual field of a flight simulator is simple in the extreme: blue sky. A small portion of the field of view might be taken up by the ground, but even this is a simple image that is easily constructed from a map. The only other visual element is other aircraft, which again require only a limited set of visual images. Thus, flight simulators solve the problem by confining their video to a narrow range of imagery that is calculable on the fly.
But this approach is nothing more than a ploy; it can only be used in the most contrived of situations. Most importantly, it cannot be used with the most important range of images: human beings, and especially the human face. If we wish to present interactive entertainment with any broad appeal, it is imperative that we be able to present the images of our characters’ faces, and canned video will never give us the broad range of images that we need to cover with a truly interactive application.
Conclusions
The current crop of computers does not provide enough power to allow trouble-free use of video in interactive entertainment. The limitations of these machines force us to make unpleasant trade-offs between video and interactivity. Consumer pressures will push us towards an unbalanced compromise favoring video, because consumers can more readily appreciate what they already understand (video) than what they don’t (interactivity). In the long run, however, such compromises will communicate to the consumer the message that "interactive entertainment" offers few advantages over existing video services.
The use of text or "comic book" segmented graphics eliminates these ugly trade-offs. The machines that we now have pack enough power to handle a broad range of textual approaches to interactive entertainment.
I am not arguing that the use of text will sell more product than the use of video. Instead, I am arguing that greater reliance on text will permit higher-powered interactivity. Whether such interactivity will sell is a matter for the sales executives -- and ultimately, the customers to answer. But as designers, we have to ask ourselves where our own responsibilities lie. The customers may not be able to appreciate what they’ve never been exposed to. They don’t know any better. Do we?