by John MacCormick
Most of us spend so much time working on our specialties that we seldom get the opportunity to learn about some of the nifty-keen ideas in other fields. Such is often the case with algorithms. Yes, we all know the basic algorithms, but the world of computers has been profoundly changed by a number of crucial algorithms. This book explains those algorithms in non-technical language. Oddly, it presents only eight algorithms; perhaps it is the fate of computer scientists to sacrifice their ability to count in order to attain the high levels of mathematical sophistication they use.
The first algorithm that Mr. MacCormick tackles is the simplest: the algorithms used to (1) search the Web for pages that match some search criteria. This might seem simple enough, but for the fact that the Internet has billions and billions of pages and organizing all that information into a coherent database is a daunting task. The basic idea is that we have zillions of words splattered all over the Web and we need to make an index of all those words. We store the words in a huge database, and for each word, we record the page address and the location in the page. Take, for example, the word aardvark. Our index will have thousands of instances of that word, each instance recorded in the entry for aardvark, along with its address and location on the page. Then when somebody searches for aardvark, we need merely present the list of entries for that word.
This is the basic design for the first generation search engines, such as Alta Vista. But there’s a second problem: ranking the pages by likely value to the user. A long list of uses of a single word wasn’t a problem in the early days of the Web, when there weren’t many references to aardvarks. Nowadays, when the Web is full of references to Aardvark Pizza, Anthony A. Aardvark, Aardvark Software, and so on, we need more: we need the list of pages to be sorted by the likelihood that they’re what the user wants. This was partly addressed in the first-generation search engines by asking for additional words (e.g., aardvark, mammal) and looking for pages containing both words; a further improvement was to look for cases in which the two words were in proximity to each other.
But Google changed all that with a superior (2) page ranking algorithm. Their basic concept was to look for pages containing the search term that other pages referenced. In other words, in finding the best page for aardvark, they’d look for the page that had the most other pages linking to it. The idea was that a good page will probably be linked to by other pages. This insured that Google searches often put the most useful page at the top of the list. They have continued to improve the algorithm, of course, but this was the starting point.
The third crucial algorithm is (3) public key cryptography, which Mr. MacCormick aptly describes as “sending secret messages on a postcard”. How, for example, can you send your credit card number over the Internet, where it might be read by anybody? It must be encrypted: put into a secret code that bad people can’t read. But how would the recipient read it if it’s in a secret code? This seemed like an impossible task until three computer scientists came up with a truly brilliant solution, now called the RSA algorithm. Sadly, I think that Mr. MacCormack’s explanation of the concept is muddled. He approaches it in stages, bringing in one element at a time; the result is definitely confusing. The core issue is finding an irreversible computation: a calculation that is easy to do but extremely difficult to do in reverse. For example, addition is easily reversible. 4 plus 6 is 10. If you know the answer, 10, and also know one of the two addends (4 or 6), then it is trivial to figure out what the other one was. But what if there were a calculation that takes two numbers, x and y, and combines them to yield z, such that knowing y and z is insufficient to figure out x?
There is no algorithm that is theoretically irreversible; every such calculation can eventually be found, if only by simple brute force trial and error. For example, suppose that we have an algorithm called Encrypt, and if we put x and y into Encrypt, it will spit out z. We could then send y and z to somebody else, and if a bad guy wanted to know what x was, he’d have to try out all sorts of numbers: 0,1,2,3… and so on until he hit one that worked with y to produce z. Since computers can do this kind of thing in a flash, we need an algorithm that will take a long, long, time, much longer than, say, 100 years, even with much faster computers which we expect will be available in a few years.
The basic trick involves prime numbers. If you multiply one prime number (x) by another (y), you get a number (z). The important thing about z is that there is only one pair of prime numbers that, when multiplied together, yield z. But here’s where I get lost. A fourth number, I’ll call it w, is used in the calculation. The guy who’s supposed to receive the secret communication publishes, for all to see, the number y. When you want to send your credit card number to him, you take his number y, then use your private value of w, which you keep secret, along with x to produce z. You send z to the guy and he’s able to use his own secret value of w to run the calculation backwards to figure out the value of x.
OK, I admit that my explanation of the idea is flawed. So is Mr. MacCormick’s. After reading through it several times, I still couldn’t get it. Perhaps I’m stupid. Perhaps I’ve got a mental block. Whatever. It’s not important enough for me to slog through the logic to figure it out.
Moving on to the fourth algorithm: (4) error correction. Here’s the problem: computers are fallible. Hard disk drives sometimes get a bit wrong; sometimes your transmission over the Internet is slightly garbled. What happens when one of the zillions of bits moving around in your computer gets flipped? If there were no error correction, the computer would get screwed up and might even crash. So, how do we prevent errors? The simplest way uses a checksum. Suppose that I want to send you the following list of numbers: 1823, 27, 91, 4. Along with the list, I attach the sum of the numbers: 1945. When you receive the list, you add up all the numbers in the list and check your sum against the sum I sent. If an error was introduced along the way, one of the numbers will be different, and they won’t add up to 1945. If that happens, your computer can ask my computer to re-send the list.
Modern error correction is more sophisticated than this, but this version gives you an idea of the solution: attach something to the data that can be used to verify the correctness of the received data.
Next comes (5) pattern recognition. This is the process that allows a computer to read hand-printed letters and figure out what they mean. The most common practice involves a neural net, a program that simulates a highly simplified (and much reduced in size) version of the brain. The neural net, once completed, goes through a training phase before it can be used: thousands or millions of times it is presented with an input and told what the output should be. Each trial run enables it to hone its internal connections in such a way as to do a better job of recognizing the pattern. After enough trials, it can be presented with new patterns and it will correctly recognize them. Much has been accomplished with this field but it is still undergoing lots of development.
A side note: in 1971 I wrote a neural net program and ran it under many different configurations, coming up with some fascinating results. When I showed it to a computer scientist, though, he told me that it had been mathematically proven that this kind of thing couldn’t work. The disproof was published by Marvin Minsky and Seymour Papert in 1969. It turns out that this paper made some assumptions that aren’t necessarily true in the case of computer neural networks. My technology could indeed have done some wonderful pattern recognition had I developed it further, but my primary interest at the time was physics, so I let it lapse. I still have one idea, though, that I’d like to play with someday.
The sixth important algorithm is (6) data compression. Actually, this is a bunch of different algorithms. The basic idea here is that the raw data we use in our computers has all sorts of redundancy that takes up a lot more memory than is necessary. For example. I stored a photograph in two formats: TIFF and JPG. The first format is an exact, uncompressed rendition of the photograph; it takes up 4.9 MB of space. The same photograph stored in JPG format, a compressed format, requires only 625 KB of space. In other words, the JPG format compresses the image to about one-eighth of its original size – and the two images are indistinguishable to the eye!
Every kind of data has its own patterns and so every kind of data has its own optimal compression algorithms. Text, sound, images, and movies all have their own compression formats, and usually there are multiple formats available for each type, which can make working with them a headache.
Databases comprise the seventh important algorithm (or, in this case, system of algorithms). Databases must operate under stringent requirements. Suppose, for example, that you write a check moving $200 from your checking account to your savings account. The bank’s computer looks at the bank’s accounts database, finds your account, subtracts $200 from your checking account, then adds $200 to your savings account. Problem: what happens if the computer crashes just after subtracting the $200 from your checking account, but before it gets around to adding the $200 to your savings account? In such a case, your $200 simply disappears. Obviously, this is unacceptable, so there needs to be a better way. The basic solution is to set up transaction records. The transaction record for your action would read as follows:
1. change checking account from $1400 to $1200.
2. change savings account from $3000 to $3200.
This goes into a to-do list of transaction records. In normal operation, the computer gets the check, checks your account, and writes the transaction record. Then it executes the transaction record. When it has finished executing it, it deletes the transaction record. The value of this approach is that, if anything goes wrong at any point in the process, the computer can recover. No matter when the computer crashes, when it comes back up, it checks the list of transaction records and begins executing them, and every single one of those changes will be correctly made.
Lastly comes (8) authentication, the process by which your computer can verify that the website it’s talking to really is the website you think it is. What happens if you access your bank account online, but actually somebody has fiddled with the local Internet node and re-routed all bank account accesses to a fake bank website that can harvest all the names and passwords that users like you send it? The solution is something called a digital signature, and it’s nothing more than a certificate declaring that the website you’re talking to really is what it says it is. But what prevents the crook from faking the digital signature? This is really just a version of the encryption algorithm described above. The bank’s computer sends your computer an encrypted verification number that your computer can check with one of the established Internet authorities for its veracity. Because the crook can’t break the encryption, your computer can verify that things are as they should be, confident in the veracity of its connection.
All in all, this is an excellent book. Mr. MacCormick does an excellent job of explaining, in entirely nontechnical language, the basic ideas behind some of the most important algorithms that make the Internet work. His explanation of RSA falls short, I think. But RSA really is tricky business, and I certainly couldn’t do a better job than he did. Thumbs up.