March 16th, 2011
I have long complained that few designers have a good grip on the concept of interactivity. That situation has been slowly improving over the years, but we still have no means for quantifying interactivity. I here propose a unit of interactivity and describe how it can be measured.
First, some background considerations. It is meaningless to contemplate the interactivity of a piece of software at a single instant in time; interactivity takes place over a period of time. Indeed, in some products, the interactivity is distributed over a long period of time. Consider, for example, that even as I write these words just now, I am considering their effects on what I shall be writing later. It is impossible to disentangle the interactivity of this moment from that an hour or two hence when I have completed this essay. We see the same thing in games; ofttimes the decisions that a player makes early in the game are based on anticipation of the situation much later in the game. Therefore, interactivity must be measured over the entire period of interaction.
At its broadest, we would consider interactivity over the entire period of use of a piece of software. For example, as I write this single essay, I may well try out a few experiments to improve my understanding of the editor I am using; those experiments might not have any effect on this essay, but they could well change the way I use this editor when I write later essays. Similarly, a game is meant to be played many times; the player’s frequent failures in early playings serve to improve his performance later. Strictly speaking, then, a full and proper measure of the interactivity of any product would require continuous measurements of the interactivity in each of the sessions of use during the entire life of that application. But this is too onerous a task to be of any utility. While it is theoretically sound, if we are ever to apply this measure in a fashion that helps us learn more, we’ll have to cut a few corners.
I therefore conclude that the best period of measurement of interactivity is the single session of use: the period starting when the user opens the application or game until the user closes that application or game. Typically such periods will range from a few minutes to a few hours.
Measuring the interactivity within such periods requires us to fall back on my old definition of interactivity: a cyclic process in which the user and computer alternately listen, think, and speak to each other. It is impossible for us to measure the user’s thought processes, but we can measure the information content of what the two parties speak to each other. That information content is, after all, the direct result of the thinking process of each actor and therefore is vaguely commensurate with it. Here, then, is my measure of interactivity:
The interactivity of any application is measured by the square root of the product of the total information communicated from computer to user and the total information communicated from user to computer, divided by the length of the session in seconds. The units of measurement are therefore bits/second. In mathematical form:
The First Term
So, how do we measure these “total information” values? In the case of the computer, that’s actually rather easy to do. We capture the screen display as a movie lasting the entire session, and then store that movie in a lossless format. Unfortunately, there are no appropriate lossless movie formats that do not oversample the information content of the movie, so we will have to rely on a low-lossy format such as MPG. This raises some minor problems. Suppose, for example, that an application presents the player with an absolutely static screen that never changes. (This is, of course, an absurd case because if nothing happens, there can be no interactivity). Even so, the actual information content of the movie should be no more than the information content of a losslessly compressed static image -- yet the MPG file will surely be much larger than the static image file. I think that we can live with this difficulty because any real imagery will surely show some activity, which will show up as an increase in size of the MPG file. The distinction may introduce some biasing between highly animated images such as we see in 3D first-person games and lower-activity images such as we see elsewhere. An alternative is to measure the amount of graphical change arising from each action on the part of the user and add up all those graphical changes. However, it is unlikely that the two results will yield similar values even when applied to the same program; it is likely that a scaling factor will be required make results of the two methods directly comparable. Further study is required on this point.
One could argue that this measure will overstate the true information content communicated from computer to user, because there will be some redundant or unnecessary information (“eye candy”). Ultimately, however, I maintain that such technically unnecessary information is nevertheless sensed and appreciated by the user, and therefore should be included in the overall calculation. For example, suppose that the output of an application includes a simple warning sound. That warning sound communicates just a few bits of information, yet it could be stored as a digitized frog croak or siren sound that takes up a considerable amount of memory to store. Do we measure the information transmitted by the computer, the information sensed by the player, or the amount of information communicated by the message? A hard-nosed definition would use the last of these three possibilities, but this would imply that a game using simple beeps is every bit as interactive as another game accomplishing the same results by providing a human voice message. I don’t think it correct to equate simple beeps and buzzes with the human voice.
The decision is forced upon us by the fact that we cannot measure what the user actually senses; we can only measure what the computer transmits. I think it fair to assume that everything transmitted by the computer is in some fashion intended to be acquired by the user. There are a few counterexamples, such as a display of complex noise. This is not a completely absurd case; consider the screen display of the Atari 2600 game Yar’s Revenge:
These random bits require a great deal of information to transmit (actually, they are simply a graphic representation of the object code), yet they truly do not represent a great deal of information actually received by the player. Nevertheless, I am willing to accept this weakness in the approximation on the grounds that such displays are rare; few games or applications rely on randomly generated information displays.
So we have our means of measuring the amount of information communicated from computer to user: simply capture the screen display as a movie covering the entire period of play, save that movie in a compressed form, and use the resultant file size.
The Second Term
Next we turn to the problem of measuring the information communicated from the user to the computer. This is, theoretically, another simple task: simply count the number of verbs available to the user at any given time, then use Shannon’s standard definition of information content:
Information content of a message = -log2(probability of that message)
For example, suppose that I am going to flip four coins and tell you the outcome in terms of the how many coins came up heads. The probability of them coming up as four heads is only 1/16, so the information content would be 4 bits. If instead they came up with two heads, this is a more likely result (probability = 6/16), so the information content would be only 1.4 bits.
Suppose then that our user is playing a simple game in which he has only 4 choices at any given point, and each choice is binary in nature, and he can select only one choice at a time. In other words, the player is confronted with a situation akin to one in which he faces four doors and must choose which door to go through. The probability of each choice is 1/4, so the information content of each decision the user makes is 2 bits; if the user makes faces 100 sets of doors, then the information content that he communicates to the computer is only 200 bits.
In practice, this calculation is never so simple. The player will have many more choices; some of those choices will have differing probabilities; and in real-time operation, the speed with which those choices are made can be very high. Suppose, for example, that a player in a first-person shooter opts to quietly wait for some period of time. He is choosing the verb “do nothing”. But if he chooses that verb for two seconds, does that constitute less or more information than if he chooses that verb for five minutes? If so, how rapidly should we sample his actions? Typically player’s actions are polled every 60th of a second; does that mean that the player makes 60 decisions each second? I maintain that “do nothing” is not a verb at all; it simply means “don’t interact” and thus does not constitute any kind of message from user to computer. It is true that in some rare circumstances, such as a player lying in wait for prey, “do nothing” takes on interactive significance, but such cases are rare enough to be safely ignored. After all, how many players in first person shooters sit around doing nothing much of the time? If I stop to think while writing this essay, is that part of my interaction with the computer or does it fall outside the purview of the interaction? I claim the latter.
How do we determine the probability of each verb? On the one hand, we could simply define the probability of each verb to be the fraction of its usage over the entire session. This is mathematically pure, but it has one weakness. Imagine an absurdly simple game in which the player has two choices: “shoot the monkey” or “shoot the baboon”. Suppose that one player opts to shoot the monkey every single time and never shoots the baboon. Then the probability of the player using the verb “shoot the monkey” is exactly 1 and the total information content of his session is exactly 0. This is obviously an absurd result.
The alternative is to treat each verb as equally probable. This accurately represents the free will that the user exercises during the session. However, it has a more serious flaw: consider for example, that a session with a word processor will include the verb “Save” only a few times, and will include the verb “keystroke: e” many times. Because the first verb is treated as equally probable as the second verb, the probability of the second verb is measured to be lower than its actual use, which will in turn incorrectly elevate the calculated information content of the session.
I conclude that the absurd result of the first solution (measuring probability by usage fraction) is due only to the absurdity of the contrived situation. In any real session, the probabilities are best represented by the actual fraction of usage.
One last quibble must be addressed: what about logical linkages between verbs? That is, not all verbs are completely independent of each other: in some cases, use of one verb might often be followed by use of another verb. For example, in a word processor, the verb “keypress: q” is often immediately followed by the verb “keypress: u”. In a first-person shooter, the verb “aim gun” is often followed by the verb “shoot gun”. If verbs are not independent, then the probabilities of their use are not correctly measured by the fraction of their use. I believe that this consideration is small enough to be ignorable for the time being, but further research is necessary to determine just how serious this problem is.
We now have the procedure for measuring the interactivity of any application or game:1. Capture the session as a movie, compress it, and measure the resulting file’s size. Alternatively, measure the size of the graphical change resulting from each verb and add up all those graphical change sizes.2. Count up all the messages sent by the user to the computer. A message consists of a single keystroke (including prefix keys), a single mouseclick, or a click-drag-release event. In programming terms, a message is simply an input event. Determine the probability of each message by dividing the number of times the message was sent by the total number of messages. Then go through the entire sequence of messages, adding up the log (base 2) of the probability of each message in one grand sum. The negative of this value is the information content in bits sent by the user to the computer.3. Measure the length of the session. 4. Multiply the first two numbers together, take the square root, and divide the result by the length of the session. The result is the total interactivity that the user experienced during that session.
Some simple examples
I will now carry out this calculation for three simple cases: this webpage editor than I am using, a web browser, and a simple solitaire game. Because these are all graphically simple programs, I shall not need to capture the movies of these . Instead, I will use the alternative method of summating all the graphical changes in the imagery. Moreover, I’ll not go through the trouble of carefully counting up each and every verb: instead I shall apply some simplifying assumptions to make the calculation easier.
The information communicated from the webpage editor to me consists of a series of still images, each one containing a small change in the image representing the consequence of the message I just sent to the computer. Usually, that change is the appearance on the screen of the letter whose key I just pressed. Each keystroke, on average, changes about 150 pixels on the screen, each of which contains 3 bytes of information. Thus, the computer communicates about 450 bytes of information each time I enter a keystroke. For simplicity, and to take into account margins, leading, and so forth, I’ll bump that number up to 500 bytes/keystroke. This essay includes about 20,000 keystrokes, if we include the editing changes I made. Thus, the computer communicated a total of about 10 MB of information to me during this session.
Now let’s calculate my information transmission to it: 20,000 keystrokes. Again, I’m going to cut corners and use one number for all those keystrokes. The lower bound is the size of the ASCII character set I’m using: about 100 characters, yielding a probability of 0.01 per keystroke. However, the vast majority of my keypresses are lower case alphabetic characters, of which I typically use only about 20 characters. That implies a probability of 0.05 per keystroke. I’ll compromise on a value of 0.04 as the probability of a single keystroke. This means that each keystroke contains roughly 5 bits of information. Multiply that by 20,000 keystrokes and we get 100,000 bits going from me to the computer. This all took about 100 minutes to write, or about 6000 seconds. Here’s the calculation:
Interactivity = SQRT[(10 MB * 8 bits/byte) * (100,000 bits)] / 6000 seconds
= SQRT[(8 x 10**7 bits * 1 x 10**5 bits)] / 6 x 10**3 seconds
= SQRT[(8 x 10**12 bits**2)] / 6 x 10**3 seconds
= ~3 x 10**6 bits**2 / 6 x 10**3 seconds
= ~5 x 10**2 bits**2 / second
so the final result is about 500 bits per second.
Second Example: Web Browser
My web browser launches into Google news; I captured the screen and stored it in PNG format, a lossless compression format. The file size was 365 KB. I counted links and menus and came up with about 100 options available to me on that screen. I click on one link and it takes me to a news story whose screen image weighs in at 300 KB. Let’s suppose that I continue this process for one hour, taking about 3 minutes to read each news story. That means that I read 20 news stories, each presenting me with 300 KB of information, for a total of 6 MB of information. Each of the 20 pages I read presents me, let’s say, with another 100 links. That means that I communicated only 20 messages to the computer, each containing about 6 or 7 bits of information, for a total of about 130 bits. Thus, the total interactivity of this experience would come out to be:
Interactivity = SQRT[6 x 10**6 bytes * 8 bits/byte * 130 bits] / 3.6 x 10**3 seconds
= SQRT[6.24 x 10**8 bits] / 3.6 x 10**3 seconds
= 2.5 x 10**4 bits**2 / 3.6 x 10**3 seconds
= 7 bits/second
The browser provides only about 1.4% of the interactivity of my webpage editor session! Doesn’t that make sense? Isn’t web browsing a less intense experience than writing a long essay?
Third Example: Mah-Jong Game
For my third example, I played a solitaire game that is a variation on the Mah-jong tile game. This game presents the following screen, which itself uses 180 KB of information content:
The game is played by clicking on one tile and dragging it to a tile of the same type. When this is done, both tiles disappear, which changes about 10KB worth of imagery. There are 144 tiles total; thus, the total amount of image change during the course of the game is 1.44 MB. There’s also a nice sound effect associated with the elimination of each pair of tiles, so I’ll bump the total information output of the game up to 1.5 MB. My own input consisted of 72 moves, each one constituting a click-and-drag operation. To measure this activity, I count the number of legal moves at any given point in the game and use that number to determine the probability of the move that I made. Unfortunately, this is a complicated calculation, because the number of legal moves drops rapidly through the course of the game. At the beginning of the game, there are 34 accessible tiles, making for about 500 possible moves; this rapidly falls so that, by midgame, the player has only about ten free tiles, making possible only 50 moves to choose from. At the end of the game, there might be only four or six tiles free, making possible maybe a dozen moves. I’m going to apply the midgame situation to the entire game: 50 possible moves over 72 turns. That’s a probability per move of 0.02, meaning that each move communicates about 5.6 bits per move, for a total of 403 bits communicated from me to the computer. All this took place in 5 minutes, so the total interactivity of the session calculates out as follows:
Interactivity = SQRT[1.5 x 10**6 bytes * 8 bits/byte * 4 x 10**2 bits] / 3 x 10**2 seconds
= SQRT[~5 x 10**9 bits**2] / 3 x 10**2 seconds
= ~7 x 10**4 bits / 3 x 10**2 seconds
= ~2 x 10**2 bits/second
to about 200 bits/second
Consider the results for these three examples:
Webpage editor: 500 bits per second
web browser: 7 bits per second
solitaire game: 200 bits per second
Do you find these results surprising? I don’t. Still, I’d like to do a better job with the calculations and extend the measure to other kinds of software.
Another Useful Measure
This definition also provides us with another useful number: the ratio of information communicated by the computer to the information communicated by the user. For our three programs, that number, which I will call the “interactivity ratio” is as follows:
Webpage editor: 800
web browser: ~400,000
solitaire game: ~30,000
The interactivity ratio tells us the degree to which the computer talks relative to how much the user talks. We might call it the “hog-the-conversationness” of the program, although in the case of the web browser, the user does not desire to speak; the sole purpose is to listen.
A Useful Conclusion
This definition of interactivity is useful in that it immediately suggests that the best way to increase the interactivity of any program is to increase the user input: to listen more. Most software suffers from too much talking and not enough listening. That’s one reason why users can become so frustrated with their software: it just won’t listen to what they’re trying to say.