prev : next : index SPEW

April 30, 1999: the cracks are already beginning to show

Friday

A practical problem: I want to make myself some rules about what a reader can expect about the malleability of these pages. I want to describe those rules to you so you know what to expect; but first I must make some rules and determine that I am comfortable with them.

I think, as a practical matter, I should guarantee that you never need to go back to an old page, at least not without being explicitly told in a new entry.

On a philosophical point, I'm tempted to leave old pages permanently frozen, but I can see two reasons why I will probably be willing to change things. The most important is to correct errors of fact. If I misreport the date of a historical event, or somebody's name, it seems better to correct it so future readers will not be mislead. Also, if on rereading a past entry, I find that I used particular poor wording--e.g. as written, the language is hard to parse--I may rewrite it for clarity but not change the content.

The deeper philosophical point is that I find the malleability of the electronic self-publishing world disturbing. Usenet was a little different--content wasn't malleable, but it was transient--generally unarchived. The web takes this further. Pages aren't permanent: they can be taken down, moved to a new address, or even simply changed, in part or in whole.

That's radically different from traditional publishing. Oh, sure, publishing companies can pull books from the shelves, but the people who've bought the book don't have to return their copies. Newspapers can print a retraction the day after, but they don't change the original. And I assume that newspapers archived onto microfilm don't get tweaked first. (Yeah, I can download and store locally a copy of the file, but this isn't the natural mechanism nor behavior pattern.)

Our culture and our legal system makes assumptions about the nature of publishing. For example, the patent system's concept of prior art assumes that publishing is permanent. One disproves a patent by providing prior art, typically something with the same idea that was published before the patent application was file. The idea that there may have been prior art but it's gone now isn't an idea the patent system can cope with.

People think the web is so hip because it has hyperlinks. References from one document aren't particularly new; check out the bibliography in a non-fiction book for examples. The hip part is that they're "automatic"--they can take you from one place to the other with just a click.

Think again, buddy.

Bibliography entries don't just give the name of a work, and with good reason. They attempt to provide a unique identifier for the work. A crucial element of this is the inclusion of the publisher; it becomes the job of the publisher to keep the remainder of the entry unique.

Skilled computer practioners may recognize this use of "subnamespaces", familiar from directory trees, GUIDs generated using Ethernet addresses, and most noticeably, URLs and URIs, which rely on domain names as a highest-level partitioning.

But that's just it. URLs don't uniquely call out a single fixed work. Sometimes their contents change.

In one sense, URLs are radically different because they don't just attempt to provide a unique identifier for a work--they provide a description of a mechanism that allows you to immediately access the document. But, in another sense, this is just a difference of time scale. Inclusion of the name of a publisher in a bibliography entry allows a reader to contact the publisher and acquire the work in question. A publisher might have gone out of business--but then, a domain name might no longer exist, too.

It's clearly worse on the web, though. Geocities recycles addresses when people "move out", as if the appropriate analogy for web pages were telephones or snail mail--not publishing. Different documents can appear at the exact same address at different times; and so hyperlinks fail to come close to providing unique identifiers. (In truth, a timestamped URL provides a unique identifier, but very few user agents expose the underlying timestamp--and the very thing that is hip about the web as it stands--URLs--definitely do not work this way. There's no way to retrieve an article from a timestamped URL. Imagine if I wrote in print somewhere, "as was discussed in Fred's column in the New Zork Times, if everyone in the northern hemisphere were to..."--wouldn't you wonder which of Fred's columns? Which issue of the New Zork Times?)

Others might say that this is good. This is the way of electronic publishing. It's a new mechanism, a new metaphor, a new way of living. Roll with the punches.

Screw that, I say. I'm philosophically opposed to changing things, at least if they're the sort of material that traditionally gets published. A published work that cannot be critiqued is nearly worthless. A work cannot be meaningfully critiqued if it changes out from under the criticism, or if only the criticism and not the original, is available. There are a whole slew of reasons why allowing change is bad.

If I change an old page, I will indicate it at the bottom of the page (after the "navigation toolbar") and note the date of the change.

Translation

It's one thing for data to change, or become inaccessible due to its original owner abandoning it. And yes, our community has yet to come to a general solution to coping with data published by someone who then leaves this world, since as it stands web publishing requires ongoing effort (typically of the financial variety).

But another problem that has gotten some geek media attention lately is that of data rot. The concept of data rot, theoretically unique to computerdom, is that the data is still available, but there is no longer an appropriate program that comprehends the data format. This is most noticeable for "data" that are actually programs--those programs cannot be run on modern machines, only older, soon-to-be-nonexistent machines.

Of course, it's also possible to have a bunch of 5 1/2" Atari 800 floppy disks. The data on them is even more inaccessible, since the physical mechanism required to retrieve the data may well be unavailable as well. But if you can get it off the physical medium--and if the data hasn't actually gone bad there--then it's always possible, at least in theory, to write software to handle the data.

Indeed, this works even if that data is a program.

The process of running a program for another machine is referred to as "emulation"--the new machine emulates (pretends to be) the old machine. Moreover, once you can emulate a machine, you can run all of the old programs that ran on that machine, including programs that handle data formats specific to that machine or those programs. To avoid data rot, emulation is all you need.

There's a booming emulation subculture right now for playing old videogames on modern computers. This seems to have come about largely because modern desktop machines are finally powerful enough to pull off the emulation.

Why is that? As it turns out, the mechanism by which computers have been designed to be programmed is by being fed a sequence of instructions over time, each instruction telling the computer what to do. The computer carries out the instruction, and proceeds to the next one. The instruction sequence is written in a "language" called the "machine language", because it's the language the physical machine is able to directly execute. When a software program emulates such a machine, it normally works by examining each instruction in turn and then performing a number of operations which simulate the effect on a virtual representation of the emulated computer. This general process of examining instructions and doing processing for them is termed interpretation. Because many computer instructions must be executed to interpret a single instruction of the emulated machine, execution is much slower with emulation, unless the emulator is much faster than the emulatee.

A few rare examples have come up in which the sequence of instructions for the original machine are not left intact. Instead, they are converted "once and for all" into a new stream of instructions for the new machine. This process typically results in much higher performance once the program is running, because the interpretive overhead is removed. This has most famously been done not in the emulation community, but for the DEC Alpha, which could run programs written for the Vax and MIPS hardware by this conversion process, which was termed "binary translation".

Unlike "interpretation", this is a natural borrowing from plain English; "translation" of programs is conceptually similar to "translation" of human languages. In fact, to use Hofstadterian terminology, my personal physical hardware isn't capable of understanding German, Japanese, or Latin, so I rely on somebody else doing the translation for me before I can "run the program".

The obvious difference is that there are native speakers of German and Japanese out there. Those aren't "old, soon to be non-existent machines".

But Latin is already dead.

We preserve writings from Latin in both their original form and in their translated form. Preserving them in their original form avoids us "losing something in translation"--but our understanding of the language by anything other than those writings themselves dwindles over time regardless. Preserving them in their translated form makes them directly accessible to average humans--at least the ones who care.

We copy hieroglyphs off a column just as we scavenge some data off some soon-to-be-unusable floppies. We keep English-Latin dictionaries and grammars at the same time that we write Commodore 64 emulators.

Data rot isn't new.


prev : next : month : index : : home
May 4, 1999: Added analogy about "Fred's column" and mentioned domain names explicitly
attribution dammit: Misplaced Rendezvoud Marillion