prev : next : index SPEW

May 7, 1999

Friday

PROGRAMMING GEEKINESS TODAY! BEWARE!

SPEW is written using a bizarrely glorified macro language that I originally wrote to reduce the effort I have to invest in writing HTML documents, and then evolved to provide the technology needed to maintain a web journal.

I think you might find it interesting to learn how it works and why it works. If you don't care about writing HTML, and you don't care about how computer programmers think, then you might prefer to move along.

SPEW: The Design and Evolution of a Macro Language

The program I use is called spew, but it had its basis in an even simpler program called macro which I wrote more than two years ago. I used that program to handle solving the problem of writing a techinical paper in HTML.

I had always found HTML a little tedious for typing, but had never found that a WYSWIG editor gave me everything I needed--they tended to be too bulky and cumbersome for the very minimal HTML I do.

So I decided to write my document in another language and then "render" that to HTML. After further thought, I decided that I would be best served by simply using a macro language. For each final file, I would have a single source file. The "compiler" to convert between them would simply go through the source file, looking for particular strings ("macros"), and mindlessly replacing them with other strings.

A macro language has the problem that it can never make major structural changes. HTML structures require start and end tags, and there's no way a macro language can get around this. Indeed, I chose to make my macro language parameterless, exacerbating that problem. With parameters, one can at least do simple block structures; to use C's macro preprocessor as an example, one can imagine representing boldfaced text by saying

   #define BOLD(x)  <b>x</b>
or representing hotlinks using
   #define LINK(x,y)  <a href=x>y</a>

On the other hand, using a macro language to reencode typesetting commands means the macro language is also available there for the "end user". When writing the aforementioned paper, I wanted to always consistently typeset a particular symbol in a particular way. Because I was using a macro language, I just defined a macro which was that symbol with that typesetting, and always used the macro.

The Clever Concept

What I decided I really wanted was to be able to write my HTML so it looked more readable in plain text as raw source code. And the most obvious idea I had for this was that if I wanted to write something boldfaced, I wanted to do it by using the classic textual device for emphasis:

if I wanted to write something _boldfaced_, I wanted to do

Note that I find this far easier to type than the equivalent HTML tags, and much much easier to read.

This led very naturally to a design: I needed arbitrary symbols like _ to be the name of a macro; they had to be able to be stuck anywhere in text (unlike, say, the C preprocessor where they follow identifier syntax rules), and they had to be able to have more than one meaning. The first _ above expands to "<b>" while the second expands to "</b>".

The only things which required some thought is how to allow for the definition of multiple meanings, how they are chosen between at runtime, and how to cope with more than two.

There are drawbacks, of course. I might still want the "_" symbol to actually appear in my text, and it now becomes harder to type, but I use emphasis more often than I use underscores. I could always choose a different symbol if that assumption turned out to be wrong. (Similarly, I use pairs of @ for italics.)

The question is how I'm going to be able to type _. The classic solution is the use of an "escape" character. Some character is chosen as the special escape character. When used, the character immediately following it is "escaped": it is printed normally, instead of having its normal effect. In C strings and many unix shells, the \ symbol is the escape character. In HTML, & is used to escape character literals, resulting in the familiar constructions &amp; and &gt; since & and < have special meanings in HTML.

Abstraction

I did something tricky in the design of macro. I dodged the escape character question. In macro, no particular character is reserved as the escape character. Any particular user can devise a scheme for escaping things by hand, using macros.

The way this works is by allowing two classes of macros. In the first "normal" kind, macro substitution happens within the text of other macros. That means I can write a macro whose output includes a "_", and it'll be converted to a "<b>" or "</b>" when it is output. The second kind of macro is the "raw" macro. The output of a raw macro never has any further expansion.

That means that if I define a raw macro whose output is "_", I can use that macro anytime I want _ to appear in my output; it will never get further expanded, and I'll get a _.

Now let me talk about the actual syntax for the macro language. For reasons of implementation simplicity, the macro model resembles that of the C-preprocessor. macro looks for lines which begin with a period followed by one of a small set of words. It then interprets the entire line as a command. Here is how boldfacing is handled.

   .raw  "_"   "<b>"  "</b>"

In this case, we have a ".raw" command followed by three "strings", whose syntax is directly cribbed from the language C. (Essentially, any text can appear within the quotation marks; a literal quotation mark is rendered by escaping it with a backslash, and a backslash is rendered by using two backslashes.)

The command is ".raw", meaning no further macro substitution occurs. The macro that is defined is one for "_"; appearances of the symbol _ anywhere in the file (outside of commands to macro) result in a substitution. The remaining two strings define the substitution text. Each time an underscore is encountered, one of the two strings is substituted. The next time, the opposite string is substituted.

If there were three substitution strings, each one would be used in sequence, and then it returns immediately to the beginning. This turns out to be a general solution to the simple "multi-parameter" problem, for example to handle hotlinks:

   .raw   "|"    "<a href="    ">"    "</a>"
   |foo|bar|  |biz|baz|
-->
   <a href=foo>bar</a>  <a href=biz>baz</a>

Note that because it's a parameterless macro language, the text has to include the elements in the same order as HTML. There's no way to define the macros so that you could give the link description and then the link destination, rather than vice versa.

The actual macro I use saves me from having to put " around the destination, but as a result is a little hard to read, due to \ escaping in the strings:

   .raw   "|"    "<a href=\""    "\">"    "</a>"

In general, though, using the " breakup to delimit the strings allows for spaces within macros and macro texts, and allows arbitrary whitespace between the strings, which makes blocks of macros a little easier to read.

I then allow escaping within the text by defining a collection of "pre-escaped" macros, which would be a lot easier to read if I hadn't chosen to use the same escape character as was used inside strings:

  .raw "\\\\"  "\\"
  .raw "\\_"   "_"
  .raw "\\@"   "@"
  .raw "\\|"   "|"
  .raw "\\."   "."

Allowing parameters would have complicated things too much, since there would have been need for some way of delimiting macro parameters, and hence restricting the general use of symbols. So I avoided parameters. Instead, there is only a single special context in which explicit delimiters occur--during a macro definition.

A Complete Description

macro has only three commands. ".raw" is used to define a macro without internal substitutions. ".define" is used to define a macro which does allow internal substitutions. ".include" is used to tell macro to start reading from another file, and to come back when finished. The most important use for this is to allow the geneneral macro definitions to be put in a separate file, where they can be shared by other documents. Sometimes it's nice to allow breaking up a large source file into multiple chunks. If I have a standalone program which generates a table, I can ".include" it into my file and not have to edit the source file when it changes. Finally, macro concatenates lines which end with a trailing \; since macro definitions must be on only one line, this provides a mechanism for allowing longer ones.

That's it. That's all there is to the program. On the other hand, my use of the program for HTML has a lot more to it.

Macros are only really useful if you only use a small subset of all of HTML. For example, my substitution for "|" lets you create hotlinks, but you can no longer provide other options for the HTML tags; for example, you can't set the name of an anchor. On the other hand, you can always do such things the hard way (typing the tags by hand); as long as you do them rarely, it's a net win.

Here is a complete list of what I do with macro for general HTML work:

On occasion, I've stolen | and used it for <td> instead of anchors. This allows tables to be presented in a very visually satisfying way in the source text file.

A Trick

Sometimes I wish macros weren't parameterless. In the cases where I don't use the macro very often, but it stands for a lot of typing so it's worth using it on those few times, there's a nice hack to get around it. Macros can be redefined--given entirely new definitions. Additionally, macros can be substituted within other macros.

If I want to provide a macro with the boilerplate for the header, but allow a title and a background color, I simply require the user to define the title and color as macros before invoking the boilerplate macro. The only thing is that it can't be a raw macro. To avoid having to use too many escape characters, I work around this by defining submacros.

In macros.txt:
   .define ".html"      "PRETITLE PAGETITLE PRECOLOR COLOR POSTCOLOR"
   .raw    "PRETITLE"   "<html><head><title>"
   .raw    "POSTTITLE"  "</title></head><body bgcolor=\""
   .raw    "POSTCOLOR"  "\">"
In file:
   .include "macros.txt"
   .define  "PAGETITLE" "My cool page"
   .define  "COLOR"     "ff8080"
   .html
This works, of course, because the macro substitution does not happen during a macro definition. PAGETITLE isn't substituted at the time the ".define" is processed; it's substituted at the time the macro is used. (This is sometimes taken to be the defining characteristic between macros and other string substitution techniques.)

This is clumsier than macros with parameters, but it's "free"--I didn't have to do any work to support it. And since the whole point of this system is to minimize my work, it doesn't make sense to invest more effort for cases that only happen infrequently.

Subtleties

As computer programmers might expect, macro processes text from beginning to start. That means that if I have a macro definition for "foo" and a macro definition for "oob", and I have a file with the text "foob", the "foo" will get substituted first. For example, .define "foo" "bar" will result in "foob" becoming "barb". "barb" does not contain "oob", so "oob" is never substituted.

You might wonder what happens if I have both the macro "fo" and the macro "foo" defined when the text has "foob". The specification for macro says that "foo" is substituted--if a match is found, the longest match at that location is used. I haven't gotten around to implementing this, though, because it's better to avoid using such definitions, as it becomes impossible to say "I want the macro 'fo' immediately followed by the letter 'o'", unless you escape the 'o'. However, you have to manually define all escapes, so you can't just write "fo\o"--you have to go define "\o" yourself. This doesn't actually come up much in practice, so it's not a big deal. My (technically buggy) program simply uses the most recently defined match that starts at a given location, although it would be straightforward to fix it.

Minimality

In some sense, I think macro provides exactly the minimal set of pieces necessary to accomplish the goal. It also provides exactly the right pieces so that by providing a "hidden" intermediate layer (e.g. "macro.txt"), the end result is convenient and easy-to-use. I'm not saying that maintaining those macros would be easy for a casual user; I'm not saying a casual user would be wise to use the system without understanding the macros, since things might bite them once in a while. All I'm saying is that it does work for me, and yet it's an extremely simple system--at least on the programming side.

Macro thus reminds me of the computer programming language Forth--achieving minimality while providing a crucial hook for layering, that the system might be effectively leveraged. The maintainence nightmare of the real underlying code is minimized; the exposed "language" is at exactly the right level to the problem at hand, and allows new technologies to be synthesized on top of it. Yet it has nothing HTML specific in it. The only hint of HTML-orientation in it is the multiple-meaning macros, but this could be of use for nearly any typesetting language.

In some ways, it's annoying. It's syntactically ad hoc--e.g. it's tuned for hand-parsing rather than using something like yacc. But it's very simple, and has no need to grow and get more complex, so that seems somehow appropriate.

But then came spew...

(more tomorrow)


Summary of macro syntax and behavior from the source code:

   .define "<macro name>" "<sub1>" "<sub2>" ...
         replaces <macro name> with <sub1>, <sub2>, etc. in
         turn, e.g.:   .define "foo" "bar" "baz"
        input  foo food fold foo
      becomes  bar bazd fold bar

   .raw "<macro name>" "<macro substitution>"
         like '.define', but expansion doesn't recurse

   .include "<file name>"
         includes the above file

 Macros can't cross lines.

 Lines with \ at the end are concatenated as a first pass (before
   parsing macros, so use \ to avoid the first limitation)

 Maximum line length, both after line concatenation and
   during macro substitution, is 32K.

 Macros are concatenated in place before checking again (e.g.
   a macro "zd" will be substituted in the example for .define)

 Strings have internal \ escaping a la C, but macro names
   can't have embedded newlines.  (\t, \n, \r, \", \\)

prev : next : month : index : : home