From Data to Information
Numbers and meaning
If you could look into the heart of a computer, you would find no spreadsheets, no programs, no words to process, no aliens to blast. All you would find are numbers, thousands and thousands of numbers. The fundamental measurement of a computer’s power is its storage capacity for numbers &emdash; typically 512 thousand numbers on a personal computer. With these numbers, the computer is capable of only a very small number of manipulations. It can move them, add, subtract, compare, and perform simple logical operations known as Boolean operations. Where in this mass of numbers and simple manipulations is meaning? How can the computer transform all these numbers into words to process, alien invaders, or programs?
Consider atoms. Simple things, atoms. They can interact with each other according to the laws of chemistry. There are lots of combinations there, but little in the way of meaningful interaction. Yet, put enough atoms together and you get a human being, a person with character, feelings, and ideas. If you look deep inside a human being, all you will find are lots and lots of chemical reactions. Meaning does not come from its smallest components, but from the way that they are organized and the context in which they are used.
Data is what the computer stores, but information is what we seek to manipulate when we use the computer. The key word in understanding the difference between data and information is context. Data plus context gives information. This is a fundamental aspect of all communication systems, but it is most clearly present in the computer. The computer stores only numbers, but those numbers can represent many things, depending on the context.
They can, of course, represent numbers with values, things like a bank balance, or a score on a test, or somebody’s weight. Even then, these numbers are not without a context of their own. First, they have dimensions, the units with which they are measured. We don’t say only that my weight is 110 &emdash; it is 110 pounds. The number 110 all by itself doesn’t mean anything; you have to include the unit of measure to give it a context to make it meaningful. Similarly, my bank balance of 27 makes no sense until I specify whether it is 27 dollars, 27 cents, 27 pesos, or whatever it is.
There is another context to consider when using the computer. It recognizes only one kind of number: the 16-bit integer. This is a number ranging from 0 to 32,767, with no fractions or decimal points. In other words, the computer can count like so: 0, 1, 2, 3, 4, . . . 32,765, 32,766, 32,767. It cannot recognize a number bigger than 32,767. When it reaches 32,767, the next number is just 0; it starts all over again. Now, you might wonder what use there is in a computer that can only recognize the first 32,768 numbers in the whole universe. Well, there’s a trick that programmers learned long ago. You can combine little numbers to make big numbers. Actually, we do it all the time. If you think about it, you only know ten numbers yourself. Those ten numbers are 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. You think you know more? Look closely at the next number, 10. It’s nothing but a 1 followed by a 0. There’s nothing new or different about the number 10; it’s just two old numbers stuck together!
Of course, you know perfectly well that what makes 10 different from 1 or 0 is manner in which you interpret it. The number 10 has a context of its own. We think in terms of "the tens place" and "the ones place", and so we interpret 10 as "1 in the tens place plus 0 in the ones place." Using this system, we can build any number we want. The only price we pay is that we have to write lots of digits to express big numbers.
The programmer’s trick is to do the same thing with the computer. If you stick together 8-bit bytes, you can get bigger numbers. With the computer, you pay two prices to get bigger numbers: first, it takes one 16-bit word for each part of the number that you add, and second, it takes more computer time to manipulate such bigger numbers. There is also the restriction of context: you have to remember that the numbers in such a compound number belong together and must be taken as a group, rather than individually.
It is even possible to group these 16-bit numbers together in such a way as to interpret the group as what is called a floating-point number. This has nothing to do with water or boats; it is a number whose decimal point is free to move around ("float" &emdash; get it?) within the number. The idea sounds weird until you see some examples:
Floating point numbers Integers
As you may have guessed, all floating point numbers have a decimal point. The big question about any floating point number you have is, how many significant figures does it have? Let me show you an example, using the value of π.
The first value gives π to 18 significant figures. The second value gives π to 11 significant figures. The third gives it to only 5; the fourth gives 3; and the fifth gives only one. Each number is correct to within its number of significant figures; it is rounded off from the previous one. A lot of people make the mistake of assuming inappropriate zeros. For example, that last value of π, 3, is it a 3 or a 3.0 or a 3.000 or a 3.0000000000? Many people think that 3 is the same thing as 3.000000000, but it isn’t. The next digit of π after the 3 should be a 1, but we rounded it off when we went down to only one significant figure. So, if you were trying to reconstruct the value of π after I gave you only a 3, you would be wrong to put a 0 after the 3 to make it a 3.0. In other words, 3 is not the same as 3.0. If you want to say 3.0, say it; if I say 3, don’t read it as 3.0, because it isn’t. It could be 3.1, or 2.9, or anything between 2.5000000 and 3.4999999.
The meaning of significant figures is that they show us the limitations of computers and arithmetic. Remember, each significant figure costs you some RAM space and some execution time. For this reason, some computers use only 4 bytes to save a floating-point number; others may use 8 or even more bytes. A floating-point number expressed with 4 bytes has about 7 significant figures; thus, you could express ¹ this accurately with such a computer:
π = 3.141593
This is fairly accurate for most purposes. But now we come to a nasty trick that trips up lots and lots of people. Suppose I divide 1 by 3. That should yield the fraction 1/3rd, whose decimal value is .333333 . ., with the 3’s repeating forever. Now, when I do this division on my computer equipped with 4-byte floating point arithmetic, it will report the result as .3333333, with 7 significant figures of 3’s, but not an infinite number. The difference between the computer’s answer (.3333333) and the correct answer (.3333333. . . .) is small (about one part in a million), but the fact remains that the computer is wrong. Now, this discovery tends to upset some people. They think that computers are always right, that they can make no mistakes, especially with arithmetic, yet here is incontrovertable proof that the computer is wrong. This really rattles their cage.
The problem is not that the computer is mistaken, or that it is stupid and cannot perform arithmetic. The problem is that there is no mathematical way to correctly express the value of 1/3rd with a finite number of significant figures. There isn’t enough room to be accurate in so small a space. Suppose, for example, that you had a brilliant plan to solve, say, the problem of the American budget deficit. You had figured out a detailed plan that included all the critical factors for eliminating the budget deficit without wiping out the economy. I then gave you one piece of paper and a crayon and told you, "You think you’re so smart, put your plan on that paper with that crayon." You may have the answer, but if you don’t enough room to say it, you come out looking pretty stupid. The same thing goes with the computer: with anything less than an infinite number of significant digits, the computer will sometimes be wrong by a tiny amount.
This problem is so common that it has a name: round-off error. We call it that because the computer rounds off numbers to make them fit into its floating-point format, and in the process, it can round off some of the accuracy of the number. In some cases, it can completely wipe out your number. For example, suppose as part of your plan to solve the deficit, you had developed a computer program to figure out how much money to allocate each part of the Federal budget. Let’s say that you had even figured the amount of money to go for buying file folders at the White House. Let’s say that you figured $23.57 a year would be a good figure. Now suppose you have a "bottom line" routine that adds up all the expenditures of the budget to see what the grand total is. Remember, we’re talking hundreds of billions of dollars here. Let’s say that the grand total is about $300 billion dollars by the time the program gets around to adding in your figure for file folders. Let’s say the program statement looks like this:
Now, the computer will add the numbers like this:
If you count digits, you will see that the computer’s seven significant digits are used up on the high part of the number; the 2 in 23.57 is in the eighth significant digit place, and so it is rounded off &emdash; right out of existence! It’s as if the $23.57 never existed. Your program would produce unreliable results, and you would think that it had a very mysterious bug. In truth, this is one of the natural limitations of the computer. The moral of this story is, if you want the computer to use great big numbers next to little bitty numbers, you need lots of significant digits, which will take more space and run more slowly. Accuracy truly does have its price.
Numbers can mean more than just values. They can also be used to mean alphanumeric characters. These are just letters and symbols like "a", "(", or "%". The system for using them is very simple; it uses a code called ASCII (pronounced "ass-key"), an acronym meaning "American Standard Code for Information Interchange." This code assigns a number to every character. Perhaps you used a code like that when you were a kid. A 1 stood for the letter A, a 5 stood for the letter E, and so forth. This code is similar, but its purpose is not to hide messages but to make them understandable to the computer, which, after all, only understands numbers. Another difference is that the letter A does not get a 1, but a 65, while B gets 66, C gets 67, and so forth. Every letter and symbol gets its own number. The reason why A starts at 65 is a bit of technical trivia with which I won’t waste your time.
With this one code you can store text messages inside the computer. To use it, you convert a character to a number using the ASCII code and store the number in the computer. To read it out, just convert back. Lo and behold, almost all versions of BASIC will do this automatically for you with a facility called "string data". A string is a collection of numbers that are always treated in the context of ASCII code conversion. You can always treat a string as a collection of characters, even though it’s really a collection of numbers. Using strings from BASIC is very simple. Here’s a simple example:
60 PRINT NAME$
There are only two syntax rules to note about this construction. First, a string is always indicated by a "$" symbol at the end of the variable name. That tips off the computer that you want this data treated as a string. Second, the string data should be placed inside a pair of double quotation marks.
I cannot tell you much more about string handling because different computers handle strings differently. Some allow you extensive facilities for manipulating strings, allowing you to join strings, extract a portion of a string, insert and delete sections of a string, and much more. Two fairly common facilities, though, are the ASC function and the CHR$ function. These two functions allow you to see the code conversion process. Try this little example out on your computer:
80 PRINT ASC("C")
90 PRINT CHR$(67)
The first line will print the ASCII value of C, which should be 67. The second value will print the character corresponding to 67, which is C. Thus, you can take strings apart, find their numeric equivalents, and manipulate them with arithmetic, although that is certainly the hard way to do it.
Another kind of data is Boolean data, named after George Boole, who founded the mathematics of formal logic. Boolean data is very simple: it takes one of only two values, true or false. Most BASIC languages store a zero to represent a value of false, and something else to indicate a value of true. Quite often, computer programs allow the user to set a particular choice, a choice that is either taken or not taken. For example, a program might ask you if you want some data sent out to the printer. You can answer yes (true) or no (false). The program can then keep track of your answer as a variable called, say, CHOOSEPRINTER. Then, whenever it is about to send something out, it might have a statement like this:
1120 IF CHOOSEPRINTER THEN GOTO 2000
This statement would treat the value CHOOSEPRINTER the same way it would treat a logical expression. If the result were true, it would GOTO 2000; otherwise it would continue on. Thus, the Boolean variable is a good way to keep track of such true/false conditions. Remember, though, that it really is a number, just interpreted differently.
The numbers in a computer can be interpreted in a completely different manner. They can be treated as instructions to the computer. Even then, there are two variations on this.
Your BASIC program is stored in RAM as a set of instructions for the computer. Each instruction has a code number, called a token, associated with it. For example, the token for the command PRINT might be 27. If this were the case, then the command PRINT "ABCD" would be stored in RAM as 27, 65, 66, 67, 68. The 27 stands for PRINT and the 65, 66, 67, and 68 are the ASCII codes for "ABCD". To RUN a BASIC program, the computer would scan through RAM, looking at each instruction code and translating it into action.
The second form of computer instructions are what is called native code. These are instructions that the computer itself recognizes as instructions to directly execute. The difference between BASIC instructions and native code is that the BASIC instructions are foreign to the computer. That is, the computer does not really know what the BASIC instructions mean for it to do; after it reads a BASIC instruction, it must look up the meaning of the instruction in a "book of commands" called an interpreter. The interpreter allows the computer to figure out what it is supposed to do. As you might imagine, a BASIC program is slowed down quite a bit by having to go through this interpreter. What is worse, the computer must interpret each instruction each and every time it encounters the instruction, even if it has executed that instruction thousands of times previously.
Native code is much faster than interpreted code. Native code is program instructions that are couched in the natural language of the computer. This language, called machine language, is built deep into the innards of the computer and cannot be changed. It is the fundamental language that the computer uses for all its work. A BASIC interpreter translates your BASIC commands into the computer’s machine language.
What, you might wonder, does machine language look or sound like? Perhaps you imagine some weird language of beeps and buzzes. But no, machine language is nothing more than numbers. For example, a 96 will tell some computers to return from a subroutine; it is exactly the same as the RETURN statement in BASIC. Other commands, however, are nothing at all like BASIC. There is more information on machine language in the appendix.
Data inside the computer can also be interpreted as pixel data. This is data to be displayed on the screen. To understand how this is done, you must first learn something about number systems. There are three commonly used number systems to master: decimal, hexadecimal, and binary. Decimal is the first. You already know about decimal; it is the number system that you normally use.
Hexadecimal is the second system. It sounds like a number system that witches might use to cast hexes, but actually, "hex" in this case means 6, and "deci" means 10, so hexadecimal refers to a base-16 numbering system. That is, we count by 16’s in a hexadecimal system. The idea to master here is the idea of counting up until we reach the top of the number system and start over. In decimal, we do it like this: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Now, cast aside your natural familiarity with that 10 and look at it closely. What happened was this: we reached 9, the last numeral in our possession. To go to the next higher number, we started over again with 0, but we put a 1 in the 10’s place. When we reach 99, we add 1 to 9, which takes us over the top, so we go back to 0, carry the one, which throws that 9 over the top to 0, so we carry the 1 again, and end up with a 1 in the hundreds place. The rule is simple: when you reach the highest number in the system and go up, replace it with a 0 and add 1 to the next place. That place is a 1’s place, or a 10’s place, or a 100’s place, or so on in the decimal system.
In the hexadecimal system we count by 16’s. The next 6 numbers after 9 are A, B, C, D, E, and F. So we count like this: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10. Now, be careful about that last 10. It is not the same as the 10 you are used to seeing. It’s really the number after F, and F is 15, so 10 is 16. Does that confuse you?
As you might imagine, reading hexadecimal numbers can be quite confusing, so programmers have one little trick to help out. Whenever a programmer writes down a hexadecimal number, he puts a dollar sign in front of it so that you’ll know that it is special. Thus, $10 is hexadecimal 10, or 16, but 10 is decimal 10, or just plain old everyday 10.
Doing anything in hexadecimal is enough to drive almost anybody nuts. Arithmetic is really wild. Where else would 8+8=$10? Or try to figure this one out: $30/2=$18. This stuff gets real hairy real fast. To help out, most programmers use a little hexadecimal calculator that lets them figure these things out quickly and easily.
The third numbering system that programmers use is called binary. It is a very simple numbering system, so simple that it confuses lots of people. In binary, we only count up to 1 before starting over. Thus, while decimal has 10 numerals (0, 1, 2, 3, 4, 5, 6, 7, 8, and 9), and hexadecimal has 16 numerals, binary has only two: 0 and 1. So in binary, we count like this:
Binary: 0, 1, 10, 11, 100, 101, 110, 111, 1000
Decimal: 0, 1, 2, 3, 4, 5, 6, 7, 8
This means that in decimal, 10 is 10, in hexadecimal, $10 is 16, and in binary, 10 is 2. Are you getting confused yet?
Binary numbers get very long very quickly. For example, the number 999 in binary is 1111100111. They are also very tedious to do arithmetic with. The one saving grace of binary numbers is that they directly show the status of the bits inside the computer. A bit is the fundamental unit of memory inside the computer. We normally talk in terms of bytes, because the computer is organized around bytes. But bytes are made up of bits; there are eight bits in one byte. We normally don’t worry about individual bits because one bit is too small to do much with. I mean, what can you do with something that is either 0 or 1? Not much. About all you can do is pack eight of them together into a byte, and then you’ve got a number between 0 and 255. But there is one situation in which it is handy to worry about individual bits, and that is when you are making a screen graphic. All computers draw images on the screen by breaking the screen up into little tiny cells called pixels. The word pixel is a contraction of "picture element". On a black and white display, a pixel is either black or white. A blow-up of the letter "A" makes the point better than words:
Those big black squares are the pixels that we use to draw the A on the screen. Now, notice that a pixel is either black or white. There are only two states possible for a pixel, no in between. Thus, a pixel’s state can be represented by a binary number, a 1 or a 0. We might say that a 0 means white and a 1 means black. If so, then our letter A can be represented by binary numbers, one for each row in the letter, like so:
What we have here is something very exciting and very important: the ability to express images as numbers. Now if we apply the powerful number-crunching capabilities that the computer gives us, we can process the images themselves, just by processing the numbers that represent the images. That’s how computer games are able to create those animated images. Behind every twisting, grimacing alien, there’s a microprocessor frantically shuttling numbers around.
Summary of number types
We have seen that a number can mean many different things. It can be your plain old, everyday number, like Joe’s bank balance or Fred’s weight. It can also be a character, like an "A" or a "%". It could also be a simple "true or false" indicator. It could also be an instruction for the computer to execute. Or it might be a part of an image. There are many other things that a number might mean; it all depends on the context in which the number is taken.
How is it that one number could mean so many different things? Because we can apply so many different contexts to that one number. This is nothing new; we do it all the time with words. Consider the word "dig". My Webster’s Unabridged lists fourteen different definitions for the word. A simple, everyday word like "dig" could be interpreted fourteen different ways. How could you tell which of the fourteen interpretations applied? Only from the context. If you were a foreigner first learning English, you might be angry at such a stupid language that cannot keep its words straight. Yet, as a fluent speaker of the language, you have no problem determining the exact shade of meaning of the word from the context in which it is used. So too it is with computers. They may use a number in many different ways, but the context is always clear. Thus is it possible to breathe meaning into something as meaningless as a number.
Data versus process
Let us look more closely at this concept of context. Exactly how is context established? As in so many things, the question bears the seeds of the answer. The key word to examine is "established". Context is not some static entity that lies on the page the way that data does. No, context must be established, created, or forged. Context is intrinsically part of a process; it is established or created by some activity. Here we encounter one of the most profound concepts of computing: the complementarity of data versus process in the computer.
Data are the numbers inside the computer; process is what the computer does with them. Data are passive, process is active. An idea or a message, though, is composed of both data and process, number and context. Both are necessary to create an idea or message. Oddly enough, the ratio of data to process is not fixed. Any message can be expressed with any combination of data and process. A contrivedly simple example may help make this point. Suppose that I wish to convey to you scores of six students, and suppose that these scores just happen to be 2, 4, 6, 8, 10, and 12. I could send you the information in a data-intensive form:
2, 4, 6, 8, 10, 12
Or I could send the same information in a process-intensive form:
10 FOR X=1 TO 6
30 NEXT X
Both messages convey the same information, but one uses primarily data and the other uses primarily process to convey the same information. Programmers are intensely aware of this process-data duality, and often use it in polishing their programs. If a program is too large and must be made smaller, translate data-intensive portions into more process-intensive forms. If a program runs too slowly, translate process-intensive sections into more data-intensive forms. This is because data consumes space while process consumes time. A sufficiently clever programmer can obtain almost any desired trade-off of space for time by finding the precise trade-off of data for process.
But there is a point many programmers miss. Just because data and process are to a large degree interchangeable does not mean that we should use them without bias. If you regard the computer as a communications medium, then when using a computer, you must always bear in mind the possibility of using another medium to convey your message. Consider, for example, the printed page, one of our most heavily used media. Here is a medium ideally suited for conveying data and quite incapable of directly presenting process. Nevertheless, we are able to use the printed page to convey a great deal of information about the world. It is especially adept at presenting static data. If you want to find the atomic weight of beryllium, the population of Sierra Leone, or some other simple fact, a reference book is an ideal source to consult. On a per-idea basis, there is no medium cheaper, more convenient, and more effective.
But suppose we wish to convey information not about facts, but about events. Now we are getting a little more demanding of the medium, and it does not perform quite as satisfactorily. It manages, certainly, but somehow the description of a complicated sequence of events can get a little muddled and require perhaps a few re-readings before we can understand it.
Now let’s go to the extreme of the spectrum and consider the ability of the printed page to convey information about processes. We find that the medium is certainly capable of doing so, but not very well. How many textbooks have you dragged through, trying to divine the author’s explanation of some simple process, with little success? Look how much work I have had to go through to explain to you the small ideas presented in this book. Because the printed page is a data-intensive medium, it is strongest at presenting data and weakest at communicating processes.
The computer, though, is the only medium we have that can readily handle processes. That is because it is the only medium that is intrinsically interactive; all other media are expository. Indeed, the computer might well be said to be more process-intensive than data-intensive. The typical personal computer can store 512,000 bytes of data, but the same computer can perform approximately 300,000 operations per second. If you let it run for just four hours, it can perform over 4 billion operations, even though holding same measly 512,000 bytes. This is not a medium for storing data, it is a machine for processing it.
It follows, therefore, that the ideal application for the computer will stress its data-processing capabilities and minimize its data-storage capabilities. Indeed, if you list the most successful programs for computers, you will see that key element in all has very little to do with data storage and very much to do with data processing. Spreadsheets are a good example; so are word processing programs. Both allow you to store lots of information, information that was once stored with paper and pencil. But the real appeal of these programs is not the way they allow you to store data but the way that they make it easy to manipulate data. Even the most data-intensive application on computers, the database manager, is really not a way to store data but a way to select data.
The moral of this chapter is that data is not information. Numbers without context are useless, meaningless piles of digits. The jerk who tries to intimidate you with lots of numbers is wasting your time unless he can orchestrate those numbers into a coherent line of reasoning. Numbers are only the junior partner in the partnership of information. The senior partner is context, which is derived from the processing to which the numbers are subjected. Concentrate your attention on the context behind the numbers, the reasoning that gives them meaning. Be the master of your own numbers.