The 3D Software Rendering Technology of 1998's Thief: The Dark Project

Sean Barrett

In 1998 Looking Glass Studios released the stealth game Thief: The Dark Project. This was just as 3D hardware acceleration was taking off, so due to the development cycle it didn't use hardware acceleration; it was a purely software-rendered game.

I was the primary author of the core rendering technology in Thief (although I didn't write the object or character renderers), as well as some related bits and pieces. The same rendering engine, modified by others to use 3D hardware acceleration, also did the rendering for System Shock 2 and Thief 2.

The engine was written somewhat contemporaneously with Quake (despite the game being released much later), and the basic appearance strongly resembles Quake. Many of its technologies were copied from or inspired by Quake, but in many cases the way it works is slightly or significantly different.

The Quake software rendering was thoroughly documented by Michael Abrash in a series of articles which were reprinted in his Graphics Programming Black Book. The techniques used in Thief were never written up and I thought it might be nice to write them down for once, even if they're now totally irrelevant. I'll try to describe them relative to the probably-more-familiar Quake approaches where possible.

Important contemporaneous games with similar rendering technology:

Table of contents


Looking Glass games previous to Thief had used grid-based worlds. System Shock 1 and the Ultima Underworlds computed visibility using a grid-based traversal. Thief, however, was totally free-form, so I had to start from scratch to design the Thief engine (which started long before Thief, and I imagine even before we shipped Terra Nova: Strike Force Centauri, which I worked on).

Thief was based on the portal-and-cell idea for visibility and occlusion published by Seth Teller in his 1992 PhD dissertation. In fact, the Thief world renderer was called "Portal", but since that name has a newer popular meaning I'll just call it "Thief" or "the Thief engine" (but there was far far more to the Thief engine than just the renderer, such as the object system, AI, physics, and I had nothing to do with them; I didn't write "the Thief engine", just the renderer).

I was originally exploring this idea as research for an imagined holy grail for Looking Glass: developing an "indoor-outdoor" CRPG, since the previous LGS CRPGs (Ultima Underworld 1 & 2, System Shock 1) had all been "indoor" dungeons, but we also had this outdoor tech for Terra Nova, and many of us felt we'd want to eventually make a hypothetical Underworld 3 or something which would have dungeons and exteriors and buildings and what not (this was before Daggerfall). In trying to imagine how we could put some kind of a hole in the Terra Nova terrain to incorporate dungeons, I realized that portals could be used to seamlessly integrate multiple independent renderers (like an indoor renderer and an outdoor renderer) by putting portals at the boundaries, so I wrote up a long document over one Christmas vacation documenting my thoughts along those lines. Then I guess I had all these ideas bouncing around in my head, and I knew we didn't have any idea how to do non-grid stuff, and it seemed viable to just try to make a full indoor renderer using portals... so I went ahead and tackled implementing one as research.

The portal-and-cell idea was used offline by Quake for computing its precomputed PVS ("potentially visible set")--also described in Seth Teller's dissertation--but Thief used it at runtime; Thief precomputed the portals and cells, but didn't precompute any further visibility information. (I believe Descent, which was released before I even started writing the Thief engine, had already done this, but I don't know the details.)

The Thief engine kept a representation of the level in which the open areas (the areas you could see and/or walk) were divided into convex polyhedra known as cells. Each cell contained zero or more visible "world polygons" along the boundaries of the cell that were the visible surfaces of the world. Cells which adjoined other cells had special boundary polygons called "portals" which marked the adjacency between the cells. If you could see into a given cell, then you could see into adjoining cells through the portals to them.

To determine what part of the level was currently visible and needed rendering, the engine conducted a breadth-first traversal of the portals and cells starting from the location of the viewpoint. As each portal was traversed, the cells reached were added to the visible list. Each considered portal was transformed in 3D, checked if it was back-facing, and if not it was projected to 2D. Then a "bounding octagon" was generated, which consisted of a regular 2D bounding box and a 45-degree rotated bounding box.

If a portal led from cell A to cell B, then we compared that portal's bounding octagon to a bounding octagon of the visiblity of cell A. If the two didn't overlap, then the portal wasn't actually visible. If it was visible, then the intersection of cell A's octagon and the portal's octagon was the part of cell B that was visible through this portal, so it added that to the list. (It was possible you would see into cell B through multiple paths, so the engine had to accumulate all the visible regions, which it did by simply keeping a conservative bounding octagon of all the incoming path octagons along paths. If you had two small view paths into one cell, the two at opposite ends of the cell, the bounding octagon would be much bigger, typically the size of the whole cell.) The cell containing the viewpoint was always fully visible.

Quake precomputed similar information: for each cell, Quake stored the list of all other cells that were visible from anywhere in that cell. This result could be less tight; for example, if you had a long corridor with many side rooms, and that long corridor was a single cell, then in Quake it would always try to render all the side rooms, whereas Thief would only try to render the entry cell to each of the side rooms (since those cells would always be adjacent to the corridor and visible), but Thief could cull the rooms themselves if they weren't currently visible.

Full cell-portal-analysis didn't scale well to higher polygon counts, so the engine was limping by the time the last game shipped with it (in other words, it had a lot of overhead when hardware accelerated).

Reducing overdraw

Quake reduced overdraw by using a span-buffering ("edge list") technique in which world surfaces were (conceptually) drawn front-to-back and each pixel was only drawn exactly once.

Thief drew polygons back-to-front, so they did overdraw each other. Thief's overdraw was already reduced compared to Quake's "naive" (pre-edge-list) overdraw because of the tighter bounds from traversing the portals dynamically, as described above.

Thief further reduced overdraw by clipping each rendered world polygon by the bounding octagon for the visibility of the cell containing that polygon. (This is part of why bounding octagons were used, as they significantly reduced overdraw in some cases compared to just using bounding boxes. In the limit if we had clipped to the exact portals leading to each cell, this would have resulted in drawing each pixel only once as well.) I don't have any recollection of what our typical overdraw was using this approach.

For this to work, Thief had to store world polygons in cells. This means that world polygons that extended through multiple cells had to be split; Quake was often able to preserve those as single polygons since it stored the polygons in a BSP tree directly on split planes. I'm not sure how much difference this made, though, since both Thief and Quake also had to split polygons for surface caching.

The 2D clipping to the bounding octagon meant that many Thief polygons ended up being rendered as 8-sided polygons. This wasn't a problem. In fact, clipping was straightforward and efficient since it was purely 2D along simple axes, and there were no S,T (i.e. U,V) texture coordinates on polygon vertices, since Thief used a technique I'd previously published in the PC Game Programmer's Encyclopedia in which the texture coordinates were defined as a 3D vector basis independent of the vertices, and then texture coordinates for spans and pixels were computed directly from the basis vectors.

The skybox was drawn just by tagging polygons with a special texture, then when drawing it, taking the transformed 2D polygon and clipping it to each skybox polygon and using the skybox texture mapping for each of them, or something like that. I don't remember 100%.

Object culling

As objects were moved around the level, the engine would keep track of which cells the object was in. It would do this incrementally, using what cells the object was previously in to guide the current decision. This meant that this part of the engine could actually handle arbitrary unrealistic topologies (in which portals could lead to spaces in inconsistent ways, and fold space back on itself), since the "physics" could perform these local movement operations, but we never had a way to make such things, and later a BSP tree was used to sort the cells, which then required them to be consistently defined; and the actual Thief game engine never supported it anyway. (Actually we did have a way to make such things; one thing I did prior to this engine was write a portal-based Doom-style engine, one that supported spaces-over-spaces by allowing multiple overlapping rooms, which could be at different heights (I believe Dark Forces used a similar technology) or the same height (since the height didn't actually affect how the Doom-style engine worked). With help from some coworkers I also had an editor for it, so I was later able to use those levels in my new fully-3D engine. But we never had a way to make fully 3D levels that had this property.)

After discovering all the cells that needed rendering, Thief would decide which objects needed rendering. Only objects that were in visible cells might need rendering. However, Thief also computed a 2D bounding box (or probably octagon) of each potentially visible object and compared it against the bounding octagon for any cells containing the object. (Quake had nothing like this since it didn't process portals at runtime.)

Because the Thief engine was better about rejecting invisible cells entirely, and was able to reject objects that weren't currently visible even if the cells containing them were, Thief could generally handle more object-dense worlds than a renderer made with Quake technology could have. You couldn't necessarily have more visible objects at once, but you could put more objects in overall as long as you limited where you could see them from, and so Thief generally was able to have well-stocked kitchens and dining tables and sideboards and such.

But this was just dumb luck as I hadn't chosen these algorithms for that reason in particular, especially as the engine development long preceded the eventual morphing of the game design into Thief.

Object sorting

Quake rendered objects by generating a z-buffer from the world polygons (even though the world polygons themselves didn't z-test), and then testing and updating that z-buffer when rendering objects.

Thief didn't use a z-buffer. Instead, Thief drew the world back-to-front ("painter's algorithm"), and interleaved rendering of world polygons and objects to make them appear to occlude each other appropriately.

Many games in the software rendering era (which rarely used z-buffers) exhibited sorting issues where objects would become visible through walls or invisible behind walls they shouldn't be, and portals/cells didn't offer any special guarantees that would avoid this problem. (For example, Descent exhibited such problems.)

This really bothered me and I worked very hard to fix it. The sorting algorithm in Thief is the most complicated painter's algorithm sorter I've every heard of; I remember having to write a little mathematical proof at one point for one part of it.

I don't remember the details, but I'll try and sketch out some of the issues and how it probably worked.

Originally the cells were sorted based on the portal-traversal-discovery order, which provided a topological front-to-back sort order that could be reversed for back-to-front rendering. However, there were some issues and so by the time Thief shipped the cells were actually being sorted by a BSP tree. This meant that the sort order was very far from a breadth-first search; if you were near the rootmost split plane, the draw order could feature very-close-to-the-viewer cells drawn before very-far-from-the-viewer cells if the cells and viewer happened to fall on the appropriate sides of some BSP split plane.

Because of the BSP tree there was no danger that the world polygons would render in the wrong order, but there was some danger that objects could sort wrong with respect to each other or the world polygons. To avoid this, the Thief engine (again, I'm kind of guessing here) conceptually numbered the cells in the order it would draw them. An object in cell N would normally render immediately after drawing the (inward-facing) world polygons in cell N and before drawing the walls of cell N+1. However, sometimes objects would lie in multiple cells. An object in cell M and N, with M < N, would need to be drawn after all the walls of cell M, but the parts of it in cell M might be obscured by walls in cell N or indeed in any of the cells between M and N. (A common case that came up: a corridor (cell) with a niche (cell) holding a torch, with the torch slightly protruding into the corridor. The niche is M, and the corridor is N--e.g. if the viewpoint is in the corridor, then the corridor is "closer" to the viewpoint in a back-to-front rendering sense. For example, one of the walls of the corridor will occlude parts of the niche.)

To deal with those complexities, Thief would consider whether it was acceptable to draw a given object in its frontmost cell N (or, indeed, at any point in the draw order between its rearmost and frontmost). To do this it would compute bounding octagons on the objects and on the world polygons and see whether the polygons occluded the objects. If a world polygon was supposed to be "in front" of an object and the bounding octagons intersected, then it wasn't safe to move the object later in the rendering order to a point later than that world polygon.

The Thief engine would resolve what range it was valid to draw each object in between backmost and frontmost cells containing that object; if the object was only in one cell it would always be considered as just being in that cell. Once this was determined, Thief then attempted to resolve object-vs-object sorting. Because an object might extend through multiple cells and then be behind or in front of an object that was in a single cell, objects that were only in one cell might need to be moved back or moved forward in the sort order anyway; in other words, the final range of cells that an object might be considered for rendering after might cover a larger range than just between backmost and frontmost containing cell.

Thief attempted to use the above ranges to find some sort order for objects, holding the cell sort order fixed, so that objects and world polygons would all inter-sort correctly. (This was where that proof came in.) However, sometimes this was impossible. For example, in the torch-in-the-niche case described above, it's possible for one world polygon in cell N to be occluded by the torch (the wall with the niche on it extending behind the niche), but another world polygon in N to occlude the niche and part of the torch (the wall with the niche on it extend forward towards the viewer). In this case, there is no sort order of objects vs. cells that works, because parts of the torch occlude the cell while parts of the same cell occlude the torch. It would be fixable by interleaving the world polygons from the cell with the torch, instead of always drawing all the world polygons from one cell as a unit, but not all cases could be fixed that way, so Thief actually never did that (it always drew all world polygons from one cell as a unit) -- unless I'm remembering wrong.

Instead Thief had an entirely different, extremely expensive mechanism that it fell back on to solve difficult sorting problems. Each object could be "split" into multiple pieces, one per cell. In each cell, only the part of the object which was contained in that cell was rendered. This wasn't done by actually splitting the object, but rather by rendering it multiple times, once for each "piece", and for each piece dynamically clipping the object with a "user clip plane", using similar technology to the frustum clipping that was also being done. (In complex cases, multiple clip planes would be needed; however, it was only ever necessary to clip to portals between the cells the object was in, and not literally clip to all the planes defining each cell.) If this technique was applied, then that part of that object could always be drawn in or after that cell. However, although it didn't increase the per-pixel costs, it required transforming and clipping the object multiple times, so it was rather expensive. (I encouraged designers to place their torches fully inside the niches instead.) This was particularly bad when rendering characters, which were skinned, since I think the skinning transformations would be performed multiple times.

This also had another problem: if you had three cells arranged so that the portals between them formed a T, then the dynamic clipping could create t-junctions at the border. These would then cause cracks in the rendering. I don't remember how/if we fixed this.

During development I noticed that if you used all the walls of a cell for user clip planes--not just the portals--then if an object interpentrated a wall, it would be clipped away, looking exactly like a z-buffer rendering. Normally these artifacts arose because the game physics allowed an object to interpenetrate briefly, so it was actually better not to clip those objects, and then they wouldn't look (from most angles) like they were going through the wall. But we could've used an effect like that to allow creatures that could move through walls even without though we had no z-buffer.

After my positive experience with user clip planes I was frustrated over the next few years that GPUs didn't efficiently/easily support them. (Even in D3D9, they're still done in the pixel shader; although you can use z and stencil to cull more tightly and reduce pixel shading costs.) Of course, with the z-buffer, it's not clear there are lots of use cases, but there's a chicken-and-egg problem to that conclusion.

Light and shadow

The core technique for rendering texture mapped polygons with decent precomputed shadows is very similar (identical?) to Quake. Thief used light mapping and a surface cache to store the lit surfaces. I refer you to Mike Abrash's articles.

My recollection is that we added this to the engine after we (the Thief team) saw the Quake "QTest" release. However, I'm doubtful there really was a Thief team at that point. According to wikipedia, QTest came out February 24, 1996. Terra Nova came out on March 5, 1996, so I guess we had built our final version of Terra Nova by the time QTest was released, but I can't imagine we'd geared up a whole team yet. In fact I don't know for sure when exactly I did the original research version of this engine.

At one point objects raycasted to all (or the top N) light sources to determine if they were visible to it or not and turned the entire light on/off over the entire object to simulate objects going in-and-out of shadow. But I don't know if we shipped with this or using the "check the lightmap value on the floor", which I seem to recall we at least considered using for the player's visiblity to guards (to avoid players being tricked/confused by places that appeared to casual inspection to being in shadow).

CSG model

Thief levels were created using Constructive Solid Geometry (CSG) methods, based again on knowledge that that was how Quake worked. Technologically though Thief worked very differently from Quake.

The Thief CSG model was "temporal", which maybe I can better explain by analogy to Photoshop. In photoshop, you have layers. Stuff placed in one layer obscures stuff in lower layers, except if you use more interesting blend modes, in which case stuff in a layer changes the stuff visible below it but has no affect on things above it.

Algorithmically, you can think of generating the final image by processing each layer in the image sequentially, from back to front, compositing it with the "image so far". At the time each image is processed, the image-so-far contains the accumulated effects of all the earlier (lower) layers, and then the layer is processed and the image-so-far is changed and that layer cannot have any further effect except by data it left in the image-so-far.

Normally we think of the photoshop layer model as a stack of 2D layers, but you could instead think of the above algorithm as the model, and think of the layers not as a "vertical" stack but as an ordered sequence of operations to perform. This is what I mean by a temporal model, and this is the model that Thief used for its CSG. (If the vertical photoshop layer stack is a vertical third dimension of the 2D images, then the Thief layering model would be a fourth dimension on the 3D shapes, and thinking about this as four-dimensional space is not very effective.)

The Thief CSG took as input a series of "operations" arranged in an order over time. Those operations were "brush placements", where a brush was a 3D convex solid plus a characterization of how the area covered by that solid would be changed by the operation. The entire space started solid, so one brush operation was "carve out a hole in this area"--in other words "change the area covered by this brush to open". For example you would use this to carve out a room. Another placed solid matter; you could use this to create a pillar. Another placed water, and another lava. Because space could be of 4 types (solid, air, water, or lava -- oh hey, the 4 classical elements!), each operation could be considered by which output type it produced. Moreover, we allowed these operations to be selective. For example, the "flood brush" turned air into water, but left all other types alone. This made it easier to fill an area with water--you could construct it entirely with air and then fill the lower half with water. Because of the temporal aspect, you could then go and change some of the water to "air" if you needed to. It would have been possible to make brush types that were much complicated (air turns to water, water turns to solid, solid turns to air, and lava unchanged) but this wasn't actually useful so I think all the brush types were of the unconditional-single-type or conditional-single-type.

As with Photoshop layers, the temporal aspect was something under user control; you could move a brush forwards or backwards in time. Unfortunately, this was much harder to visualize (there was no "layers" window) and I'm not sure designers were really thrilled with this.

In comparison, Quake started with a level that was all open space and placed solids. It used a more-traditional-CSG explicit-subtraction operation. Water was handled by making all water brushes occur first temporally, so that all the effective temporal sequence of operations was: all air => some air turns to water => all solids are placed. Because subtraction was explicit on the solid brushes before they were added to the world, the subtraction couldn't "accidentally" erase water, so there was no need for an explicit temporal model (the set of things you could do in Quake was more limited, but it supported almost everything that mattered in practice--although some designers came up with some strange sequences of operations in Thief to create more complicated shapes).

Because Quake levels started empty, Quake had invisible "exterior" surfaces that required a separate process to detect and eliminate. If the level wasn't watertight, then the exterior could be reachable and the automated tools couldn't eliminate it. In contrast, because the Thief level started solid, this was never necessary. (I think Unreal's CSG may have started as solid as well.)

CSG implementation

I had no idea what I was doing when I implemented the Thief CSG system (despite having the opportunity to pepper John Carmack with questions) so I made a terrible choice. The Thief system worked by dividing the world up into convex polyhedra and keeping track of what each polyhedron's type i.e. air, solid, water, lava, which I call the "medium". To do this, I stored the world as a solid BSP tree; the BSP tree classifies all of space into convex polyhedra (except at the edges where it may have unbounded shapes extending to infinity).

Using a BSP tree for this had a performance advantage, but I didn't do it for that reason; I did it because I literally couldn't think of any other way to compute the output. This way I was able to sequentially add each brush to the BSP tree and then apply the medium-transformation-operations to each BSP leaf that the brush contained.

But reading the Quake source, it turns out you can do it a different way: intersect every brush with every other brush directly, without a spatial data strucure. By being careful, you can build a growing list of convex polyhedra all of which are non-overlapping. You can then add a spatial data structure to accelerate determining which ones might overlap, but without affecting the computation itself.

The difference is that the BSP-tree based CSG tree would have situations where a brush early in the processing would be inserted in the tree early and would introduce a BSP split plane near the root which then extended across the whole level or a significant part of it. This might randomly come close to an edge or vertex of a totally unrelated brush, causing an extra split in that brush and introducing additional epsilon problems. As CSG is already painful numeric geometric algorithms with epsilon problems, introducing this extra uncontrollable problem was terrible. The Thief editor was (is?) notorious for having weird problems where a change to one brush might trigger a failure in the CSG generator at an entirely unrelated place in the level--a failure which meant the level simply failed to "compile".

Partway through Thief development I switched everything in the CSG engine from floats to doubles and cranked down the epsilon, which made things better but didn't solve the problem properly, but I realize in hindsight it would have been much, much better to have simply avoided the BSP tree entirely.

The epsilon problems were exacerbated by the crazy way I built the polygons and portals directly from the BSP-tree in a t-junc free way by computing a shared mesh along each BSP split plane to guarantee that vertices on one side of the split plane would always have a matching vertex on the other side; this introduced a much more complicated set of invariants that had to be maintained that also had epsilon problems. It meant I didn't have to write a post-processing t-junction eliminator the way Quake did, but in hindsight that would have been better.

Perspective texture mapping

Looking Glass games prior to Thief such as System Shock 1 and Terra Nova had used a "lines of constant-Z" mapper to perform perspective texture mapping. Those games typically used the perspective texture mapper for large, near polygons, but used an affine mapper for distant polygons.

Thief used a custom-written perspective mapper for all polygons. It used the trick used in Quake of issuing a floating-point divide for perspective correction for every N pixels which then executed in parallel with the texture mapping of the next N pixels; this worked because the Pentium would execute the floating-point divide "in the background" if you only used integer instructions after it (the Pentium was not an out-of-order processor in general).

Thief computed the perspective correction every 8 pixels (that is, N above was 8). Thief aligned the correction so it occurred on pixels whose x coordinates were multiples of 8. The beginning and ending of a span of pixels could be less than 8, and would call a general-purpose N-pixel mapper, but the 8-pixel spans called a routine that was hardcoded to compute exactly 8 pixels.

Because the engine changed over time and was initially used for research, it supported two different formats that textures could be stored in. One was restricted to 256x256 textures, or rather textures with powers-of-two sizes and in which the stride/pitch was always 256. This used an inner loop similar to the one I wrote up for the PCGPE. The second one supported arbitrary non-power-of-two textures but didn't support wrapping. This is the one we switched to when we added surface cached lighting, because padding those textures widths to multiples of 256 would have been far too wasteful. I believe the "step table" indexed-by-carry-flag trick used in this mapper came directly from Mike Abrash.

Details for x86 programmers:

Here is the code for the second pixel of the non-power-of-two texture-mapper-unrolled-8-times:

        adc    ebx,edi
        add    eax,ebp

        sbb    edi,edi   ; save v carry
        add    esi,edx   ; update u

        mov    byte ptr _pt_buffer+1,al
        mov    al,[ebx]

        mov    edi,_pt_step_table[edi*4+4]

Here is the code for the second pixel of the 256x256 texture-mapper-unrolled-8-times (not used in the shipping game, unless maybe it was used for water surface):

        add    esi,edx
        mov    ecx,ebx

        adc    bl,dl
        add    eax,ebp

        adc    bh,dh
        mov    al,[edi+ecx]

        mov    byte ptr _pt_buffer+1,al
        and    ebx,0babebeach

The spacing of both pieces of code shows that they were optimized for the Pentium; each one uses a nominal 4 cycles per pixel. (They also both would trigger "Partial Register Stalls" on the Pentium Pro.) The Pentium had a 2-cycle "AGI" stall that you couldn't use a register as an address for 2 cycles after it's computed, so you can see the texture fetch (from ebx or edi+ecx) is designed to compute those registers two cycles before the fetch. The full 8-pixel routines use 32 and 33 Pentium-pairs of instructions for 8 pixels (although there's also setup to get the right values into the right registers).

The 0babebeach at the end there is a 0xdeadbeef-style constant that relied on "self-modifying" code to update the constant, a common trick on the 486 and Pentium. Those processor had an "and ebx,memory_location" instruction, but it wasn't single-cycle, whereas "and ebx,#immediate" was single-cycle. This meant there were 9 places in the code that had to be modified (this is an 8-unrolled loop, plus the non-unrolled loop), but only when the texture was changed.

Unfortunately, I totally missed out on the possibility of reblocking the texture to improve cache usage, which might have allowed significant performance improvements (I don't know); I've heard that Build and other engines did do this, and since the surface cache in Thief was built in 8x8 blocks anyway it probably wouldn't have been hard to support on that end.

Texture-mapping effects

Looking Glass games generally demanded flexibility, and graphics engines faced with that need for flexibility often had a combinatorial explosion of texture mappers to handle all the different cases. The texture mappers I wrote for Thief sacrificed a litttle performance for the sake of simplicity and maintainability and maximum flexibility; the texture mapping routines above ended with an indirect branch which would branch to a routine that would decide what to do with the 8 pixels that had just been written into a temporary buffer.

There were two kinds of functions that could be invoked at this point. The most common kind was one which wrote to the "framebuffer" (which was actually just a block of RAM somewhere that would be blitted to the screen later) and returned from the original function call that invoked the texture mapper chain. The other kind of function would read the pixels from the temporary buffer, modify them, and write them back to the temporary buffer, then jump to yet another function. You could chain arbitrary numbers of these processing functions, in theory, not that this turned out to be necessary.

The codebase had a bunch of these functions:

and a few more pre-baked combinations of those, plus a few things to generate gouraud-shading values into the temp lighting buffer. (And two versions of each of the above--one unrolled for 8 pixels, and one for arbitrary N pixels, to handle the beginning and ending of each span of pixels.)

Most of those weren't used, I think. Paletted lighting in particular required setting up the palette in a particular way, and I used this in the original research engine to demo a gouraud-shaded level running at 640x400 at 20-ish fps on a Pentium 90. The level was a retextured version of Doom 1 E1M1, using a texture made by Kurt Bickenbach. He and I iterated a bit trying to figure out how to make textures that would look good and not too banded in 8-bit using paletted lighting, and we got some visually pleasing results, but in the end paletted lighting was too limited, so we abandoned it. (It was too limited artistically, and it couldn't do pitch black properly; when we ended up with a shadow-oriented game it was clearly wrong, but I think it had been abandoned long before due to the first issue.) Abandoning it meant abandoning hitting that frame rate/resolution target as well, but in the long run this didn't matter that much because LGS games were complicated enough that it's not like we spent 95% of the frame time in the renderer anyway, and we didn't ship it for a long time anyway.

The engine supported specifying an arbitrary color look-up table ("clut") for each portal, and coloring everything seen through that portal with that clut. This wasn't done by applying this as a post-process on that portal surface, which would have used extra "fill rate", but by storing cluts (and concatenating them into new cluts) and applying them while rendering surfaces seen through the portal. This effect could have been used to make force fields or such things, although it ended up underutilized. (Note, though, that if a single cell were visible through multiple paths, and the paths had different sequences of cluts, there was no correct way to render it; the Thief engine chose randomly. Probably there was no good solution to this problem, but since it was never exercised, it didn't matter.)

During development, this was used to make water "foggy", where the more underwater portals away something was, the more opaque the water became. However, this looked pretty terrible as it showed the portal boundaries clearly and although we had all gotten used to it since it was in the engine for years, I did eventually turn off the portal-varying underwater fog, replacing it with a fixed underwater fog.

It was also used to color the water surface--the water surface actually used a purely transparent texture and surfaces underwater were colored while they were rendered, instead of having to use the "translucent" read-modify-write mapper. This means the object renderers also needed to support rendering with a clut; I don't remember the details since I didn't work on those, but this was probably something we tended to support by default anyway, so I doubt this was problematic for those renderers.

I'm not sure what happened for objects that stuck through the water surface; I guess those were forced to use the dynamic user clip plane path. Because Quake used a z-buffer, Quake couldn't have both a distant wall-underwater and the water surface visible at the same pixel (the z-buffer could only store one depth), so the Quake water surface was opaque, at least in the software renderer. I'm not sure what Quake-derived engines like Half-Life did; one expensive solution would be to render the world except the water, then render the objects, then finally render the transparent water through the z-buffer (much as we do with hardware). The z-buffered transparent water effect would have been much more expensive in pixel costs than Thief's approach, although Thief was spending more effort per triangle/vertex having to render the object multiple times.

In Thief, most world surfaces simply used the arbitrary-pitch no-wrapping sampler and the plain store-to-memory writer.

I imagine that the object renderers used the shared LGS software rendering libraries and so didn't use these mappers at all.

Not covered here

I didn't write all the graphics technology used in Thief. I'll call out the things I didn't write below. I associate each of them with someone at LGS who did the work, but I'm not sure the people I can think of were the only contributors to that effort, so rather than risk missing important people out, I'm just not going to name names if multiple people might have been involved.

The 3D camera system, vertex transformation, and clipping processing were part of a shared technology library used by all LGS products. (Thief may have been the first customer of that tech--previous games having used fixed-point so they could run on x86 computers without floating point acceleration.) Thief batched together all the vertices used in a single cell and transformed them all at once, whether they were visible polygons or portals, which was possible since the 3D vertex API allowed separating things out like that. (Sort of like compiled vertex arrays.)

As noted before, general object rendering presumably also went through shared libraries. Object polygons were sorted using a BSP tree; James Fleming had written a very clever system that significantly reduced the number of BSP node decisions that needed to be made, based on the fact that single-sided polygons often couldn't occlude each other from any angle. (For example, I believe a polygonal torus, which is an object that self-occludes and needs sorting if the polygons are two-sided, can be statically sorted if it's made of single-sided polygons.)

Most importantly, Thief used character skinning with skeletal animation to render its characters, and I never touched a single line of code of that system.

I quit Looking Glass for a time; during my absence hardware acceleration support was added. (I came back and added support for colored lighting in the lighting generator and maybe the surface cache or something.)