Entries from August 2009 ↓
August 28th, 2009 — joaf
Here’s the pseudocode for the beginning of the pixel processing from last time:
- ; find the bits of S&T that index into the sprite map
mapx := span.S[19..12] & map.S_mask
mapy := span.T[19..12] & map.T_mask
- ; first half of computing the address of the sprite map index
address := map.base + mapx
address_row := (mapy << map.yshift)
- ; compute the sprite map index address and initiate a read of memory
MEMORY.address := address + address_row
MEMORY.read_8bits = 1
- [wait for read to complete; latch result into 'readresult']
- ; compute the address of the sprite data and the current pixel
mapx := span.S[11..6] >> map.S_shift
mapy := span.T[11.6] >> map.T_shift
address_row := readresult << map.tile_shift
- ; continue above
address := map.image + address_row
mapx := mapx << map.pixel_shift
address_row := mapy << map.pixel_rowshift
- ; continue above
address := address + (mapx | address_row)
Within each step, all of the operations occur in parallel; by chance none of them happen to read and write to the same register in a single cycle, so it’s non-obvious, but normally I could have a single item that said ‘x := y; y := x;’ as two statements, and the point is that x gets the y value from the previous clock, and y gets the x value from the previous clock.
Now, let’s take the above fragment of state machine logic and pipeline it. To pipeline it, each state in the above sequence becomes its own little section of the hardware with its own inputs and outputs. Each pipeline step needs to know about everything about the current pixel. The stuff in the current pixel is stored in the structure named ‘span’ in the pseudocode, so that stuff can no longer be “global”–each pipeline stage must access a different pixel’s data.
We’re going to assume all the pixels in the pipeline come from the same span; that means all the references to the ‘map’ structure above stay the same. Let’s look at the first couple steps, and rewrite things to ignore the memory access timing problem (assume we’re getting the data from a fast cache and ignore what happens if the cache misses):
- ; find the bits of S&T that index into the sprite map
mapx := span.S[19..12] & map.S_mask
mapy := span.T[19..12] & map.T_mask
- ; first half of computing the address of the sprite map index
address := map.base + mapx
address_row := (mapy << map.yshift)
- ; compute the sprite map index address and initiate a read of memory
map_cache.address := address + address_row
map_cache.read_8bits = 1
- ; compute the address of the sprite data and the current pixel
newmapx := span.S???[11..6] >> map.S_shift
newmapy := span.T???[11.6] >> map.T_shift
tile_address := map_cache.result << map.tile_shift
Much of this is exactly as it was before; however, the numbers are no longer meaningful. Before the number indicated separate stages controlled by a state machine; now every one of those operations is occuring at the same time, and the numbers are just to show the correspondence, and to show how long it takes the data to have flowed down to a given stage.
The main change in the code is that we don’t reuse any registers; we’re just going to make new registers at every stage. Remember how I said each pipeline stage is its own little box with inputs and outputs? The registers actually live at the boundary between the pipeline stages–the output of stage N becomes the input to stage N+1.
(You can’t reuse registers because everything happens at the same time; if two different stages wrote to the same register (trying to reuse it), they’d both be writing to the same register simultaneously, which you can’t do.)
Technically, as described so far, you just need an array of simple flip-flops, i.e. a buffer, to serve as these input/outputs, not a register. A register adds the ability to keep its old value instead of updating. In normal execution, we never need this ability, but if the pipeline needs to stall waiting for memory, then we need to freeze the state of the pipeline before the stage that’s blocked — which means latching the state in the registers.
But what about those span.S??? things? The problem here is that if we’re pipelined, then by the time the data flows to stage 4, the S and T variables will have been incremented 3 times (since it takes 3 cycles to flow to there). At stage 4, we need what are now old S & T values. The audio way to think of this is that we need a delay line. The simple way to think of this is that each pipeline stage must have all the information for the pixel available, and if that info varies per pixel, it needs to be a pipeline input–which means it needs to output by the previous register.
So naively we do this:
- ; find the bits of S&T that index into the sprite map
mapx := span.S[19..12] & map.S_mask
mapy := span.T[19..12] & map.T_mask
S1 := span.S
T1 := span.T
- ; first half of computing the address of the sprite map index
address := map.base + mapx
address_row := (mapy << map.yshift)
S2 := S1
T2 := T1
- ; compute the sprite map index address and initiate a read of memory
map_cache.address := address + address_row
map_cache.read_8bits = 1
S3 := S2
T3 := T2
- ; compute the address of the sprite data and the current pixel
newmapx := S3[11..6] >> map.S_shift
newmapy := T3[11.6] >> map.T_shift
tile_address := map_cache.result << map.tile_shift
Remember how last time I mentioned we wanted to minimize the number of registers and the total register bits in the system, and that’s why I reused variables? That’s still true here. We propogate S & T through the system with far too many bits given that all we’re going to do at the end is extract a few of those bits and pack them together.
So what we should do is extract the needed bits as soon as possible. And I’ll make the bit counts in the registers explicit, since we’re trying to minize them:
- ; find the bits of S&T that index into the sprite map
mapx[8] := span.S[19..12] & map.S_mask
mapy[8] := span.T[19..12] & map.T_mask
S1[6] := span.S[11..6] >> map.S_shift
T1[6] := span.T[11.6] >> map.T_shift
- ; first half of computing the address of the sprite map index
address[24] := map.base + mapx
address_row[24] := (mapy << map.yshift)
pixoff2[12] := (T1 << map.S_count) | S1
- ; compute the sprite map index address and initiate a read of memory
map_cache.address[24] := address + address_row
pixoff3[12] := pixoff2
- ; compute the address of the sprite data and the current pixel
tile_offset[24] := map_cache.result << map.tile_shift
pixoff4[12] := pixoff3
Eventually, further down the pipeline, we’ll have to write things out, so another thing we’ll need to propogate down the pipeline is the current pixel’s x location. On the other hand, every x location will be one more than the previous one (or one less if the sprite is reflected horizontally), so it would be more efficient to store the span’s x location into an x register deep in the pipeline, latch it until the pipeline flow reaches that stage, and then just increment or decrement it there.
On the other hand, because we have block RAM available, it’s also possible we could try to store some of our “registers” to block RAM–specifically, when we compute something (or access it at the beginning) but won’t use it for some number of stages, rather than cascading series of hand-coded registers (which will use a bunch of FPGA resources), we can store all the data to a circular buffer in block RAM, and just propogate along the appropriate index into the circular buffer from which the relevant data can be loaded.
Block RAM has 1 read and 1 write port, which means that on any given cycle, you can write to it once and read from it once. So the above technique can only be used for data that is only accessed in one phase of the pipeline; the 1 write/cycle is used when initializing the “register” (either loading it from the initial pixel state, or setting it on the fly from some computation), and the 1 read is the read from one phase of the pipeline. It can’t be read from two phases of the pipeline, because the pipeline is pipelined, so all the stage happen at the same time (just accessing different pixels). In fact, each conceptual register must be stored in a separate block RAM, because we can only read/write once per cycle from a given block RAM, so if we tried to store multiple registers in one block RAM they’d end up being accessed from separate stages which will need to index different spots in the circular buffer.
Since the total number of block RAMs is limited, and this would only use a small amount of the storage available in each block RAM (I think they’re 4Kbits, or maybe 18kbits?) thus wasting most of it, I’m going to proceed with the design using explicit (non-block-RAM) cascaded registers. (This is, as far as I know, how real hardware has to implement pipelining.)
August 25th, 2009 — joaf
Last time I described the basic abilities of the pixel pipeline in terms of how each pixel from a span is generated one after the other, and the various properties each pixel can have.
This time I want to start drilling down a little and talk about what this looks like “in hardware” (or at least in the FPGA).
The most important thing to understand about hardware is that the magic comes from doing more than one thing at a time. Where traditional programming is inherently sequential, clock-based hardware is inherently parallel — every operation on the chip will happen every clock cycle.
Most hardware algorithms need to use sequential algorithms to accomplish their tasks, so they’re divided into a sequence of steps. A piece of “control logic” can then enable and disable each task on every clock; the upshot is that one small part of the hardware might be set up to have three steps, and to execute them in order, 1, 2, 3, and then repeat starting at 1. Typically such a process would work as follows: step 1 accepts input from the outside and process it to some temporary values; step 2 operates on those temporary values and produces new temporary values; step 3 operates on those temporary values and then passes them along to some other unit.
These controlled sequences may have variations; they might do different things after step 2 depending on some control inputs from elsewhere in the system, or based on the values of the data. In general a piece of localized, locally controlled sequential logic like this is called a “state machine”; it’s a machine that goes through a sequence of states which determine what it does.
If we look at the pixel pipeline defined last post, we have the following main operations that might need to be done (not counting updating variables for the next pixel, just to simplify things):
- Use x coordinate to determine where in line buffer to write pixel
- Use S and T coordinates to determine which sprite map tile to read
- Read sprite map entry
- Use sprite map entry to determine which sprite tile to read
- Use S and T to and sprite tile to determine where to read pixel from
- Read pixel
- Use pixel and palette to find RGBA color in palette data
- Multiply by RGBA scale color
- Add RGB additive color
- Blend to linebuffer address
In converting this to a state machine, we may find some of these can be done, and some take multiple steps. The end result might be something like this (this isn’t a real FPGA programming language, just pseudocode):
(“span” is the current-to-this-pixel data from the current span; “map” is the map selected by the current span)
- ; find the bits of S&T that index into the sprite map
mapx := span.S[19..12] & map.S_mask
mapy := span.T[19..12] & map.T_mask
- ; first half of computing the address of the sprite map index
address := map.base + mapx
address_row := (mapy << map.yshift)
- ; compute the sprite map index address and initiate a read of memory
MEMORY.address := address + address_row
MEMORY.read_8bits = 1
- [wait for read to complete; latch result into 'readresult']
- ; compute the address of the sprite data and the current pixel
mapx := span.S[11..6] >> map.S_shift
mapy := span.T[11.6] >> map.T_shift
address_row := readresult << map.tile_shift
- ; continue above
address := map.image + address_row
mapx := mapx << map.pixel_shift
address_row := mapy << map.pixel_rowshift
- ; continue above
address := address + (mapx | address_row)
and etc.; the above is only the first half or third of the pixel pipeline. In this case, the state machine is simple; each state executes immediately after the previous one on the list (with pauses waiting for the external memory to return results).
One thing you’ll see in the above pseudocode is that I kind of haphazardly reused variables. Each variable is a “register” in the hardware, and the registers use up hardware real estate, so it makes a lot of sense to resuse the register for different purposes at each stage. In fact, a careful design will presumably try to minimize the total number of registers (and the total number of bits across all registers, since registers don’t all have to be the same size) to deal with this.
However, I didn’t bother for this because this aspect will go out the window soon.
If we ignore the wait for memory, we can look at this and see how long it will take. The processing of each pixel has to flow through each of these steps. It’s possible we might be able to do more work in each step (e.g. add more than two numbers) or have each do less (perhaps we can’t do as many shifts per clock as the given design assumes), and these things will influence the clock rate–if we do a lot of work in step #2, then step #2 may take longer than the other steps to run, but we have to set the clock (which is global across the entire design) to a single rate, i.e. slow down the overall clock. This is why the address calculations are split over multiple cycles.
So, as written, the entire pipeline will probably work out to around 15-20 cycles for a single pixel (of the most slow-to-decode kind of pixel). We can now do a back-of-the-envelope calculation for this. If we can get the FPGA to run at 50 Mhz, and we can process about 3 pixels in 50 cycles, then we can process 3 million cycles per second. We know we want to run at 60 frames-per-second, so that gives us about 50,000 pixels per frame. This is a problem–it takes too much time, too many cycles, to process each pixel. Now, if you use simpler pixels, it can be faster, but this will significantly limit how many complex pixels you can use.
A 320×200 screen is 64,000 pixels, and to make a reasonable videogame we have to drawn more pixels than there are screen pixels–we’ll fill the line buffer with a background, and then draw sprites on top of that. In the most extreme, we might like to have something like 3 layers of parallax scrolling, which amounts to needing to draw something like 3.25 pixels per screen pixel. At 320×200 that would be 208,000 pixels, which gives us only 4 cycles per pixel at the proposed clock rate (and who knows if the rate is achievable). The use of sprite maps is the single biggest thing using up cycles, and the use of sprite maps is pretty much mandatory for stuff like parallax scrolling. So saying “use mostly simpler sprite types” doesn’t help.
But remember what I said before: the magic comes from doing more than one thing at a time. The trick is to “pipeline” the above pixel processing so it’s like an assembly line; we can make it so each pixel still goes through 15 steps of processing, and thus takes 15 cycles from start to finish, but once it’s out of the first step of processing another pixel enters that step, and so on down the assembly line, so in the end the limit is how fast the conveyor belt is moving (the FPGA clock speed), not how many stations there are along the conveyor belt (the number of clock cycles needed to process each pixel).
One of the neat things about pipelining is that, at least for simple things, the hardware description of a pipelined process isn’t much different than the non-pipelined one. Next time I’ll take the pseudocode above and alter it into pipelined form.
August 19th, 2009 — joaf
Last time I described the way that a final image can be sensibly created by rendering a single scanline at a time.
To date I’ve been slowly working backwards from the final output signal, showing how each process produces a given output from the previous input. The last article described a process with the input a “list of depth-sorted active spans”, so that’s the next thing I need to show where it comes from to continue working backwards to what the CPU actually gives to the GPU.
However, I’m going to break off for a second and drill down into the span rendering. Last time I focused on how sorting works and how the span buffers get managed, but I didn’t really talk in detail where the individual pixels come from.
Each active span carries two kinds of data: horizontal data (data for rendering a span), and vertical data (data for updating a span for how it should appear in the next scanline). For this post, we’re just going to consider the horizontal data, since that’s what drives the span rendering, and hence the actual pixels.
Each active span has an x coordinate and a length; these determine which pixels in the line buffer are written to. Each sprite then has information describing where the sprite data itself is stored (discussed a ways below).
Sprites can be recolored. There can be a multiplicative RGBA value (each value from 0 to 1 multiplies the corresponding pixel color) and an additive RGB value (each value is added to the corresponding pixel value, and then the value is “saturated” to the max color value).
Sprites can be scaled; if they’re scaled, the span data includes a fixed-point dS/dx value, which stores the reciprocal of the scale factor. The renderer creates an “S” value which is a pixel index into the sprite. For an unscaled sprite, S is incremented after every pixel. For a scaled sprite, Whiz adds dS/dx to S after every frame. The integer part of S is then used to determine which pixel from the sprite to actually render. If dS/dx is less than 1, some integer S values will be generated more than once, causing some pixels to repeat, so the sprite is scaled larger. If dS/dx is more than 1, some integer S values will be skipped, effectively shrinking the sprite.
Sprites can be reflected horizontally; S is initialized to the last pixel in the sprite scanline, and S is decremented (or dS/dx is subtracted) after each pixel is rendered into the line buffer.
Sprite maps (see below) can be rotated. To do this, another value, T, is introduced. The source data is viewed as a 2-dimensional array of pixels, indexed by S and T. On each span, S is incremented by dS/dx and T is incremented by dT/dx after each pixel. In this case, the span also encodes an explicit starting S value and an explicit starting T value. This causes Whiz to walk “diagonally” through the sprite data. If you imagine a rotated object and a horizontal slice through that object, you’ll see that the slice goes diagonally through the object, so this is exactly what we want. (This actually allows for simultaneous rotation and scaling; rotation without scaling in this scheme requires placing a constraint on the values dS/dx and dT/dx can take on at the same time, but we have no reason to do this constraint ourselves since allowing rotation and scaling is great. The only thing we lose is you can represent rotation alone in less space, but there’s no way to represent it in less space and make Whiz efficient at handling them, so.)
Now, back to the sprite image data. In the simplest version, the span stores a pointer to 32-bit RGBA pixels which are the pixels for this span. This is a “direct image”.
Slightly more complicated case: the span stores a pointer to image data that consists of 8-bit palette entries. Whiz provides for 2048 palette entries; any one sprite uses a contiguous subset of 256 of those entries. (Thus the user can divide it into 8 palettes of 256 values, or 64 palettes of 32 values, or whatever.) To accomplish this, the span also stores a palette “base”. The paletted image pixel value is added to the base, and the entire thing is looked-up in the 2048 entry palette. (The span may not store a full 11-bit palette value; we may require base to be a multiple of 2 or 4 to reduce the number of bits necessary to encode the base.) The palette is stored in block RAM on the FPGA, so it’s cheap to access here.
An entirely different case is the “sprite map” case — this is equivalent to “tile mapping” or “character mapping” on other hardware. A sprite map consists of a 2D array of 1-byte values which are “sprite indices”. These sprite indices then index into a “sprite table” to choose a sprite.
To be more precise, each span stores a “sprite map index”. This is an index into an array of “sprite maps” which are stored in block RAM on the FPGA; each sprite map stores a number of values used for image decoding. (Note that this limits the number of unique sprite maps you can have in a single render, which may be limiting. If this data was stored per span it would be much more flexible, but it would take far too much memory.)
Each sprite map has a “tile size” (e.g. 4×4, 8×8, 16×16, 32×32) and a “map size” (e.g. 64×64, 128×128, 256×256). The effective number of pixels in the map is the product of the two on each axis. (But it wraps around outside of that region.) Given an S,T coordinates for a pixel to display, you split the S and T coordinates into the bottom bits (which index a single tile) and the rest of the bits (which index the map). You concatenate the S & T map bits to create a map index, and concatenate the S & T tile bits to create a tile offset.
There is a “map base” address; you add the map base and the map index and fetch a byte at that address from DRAM. (Actually, you keep the last DRAM fetch for this purpose and check if it’s the same one, so when multiple adjacent pixels come from the same tile, you don’t have to fetch the tile repeatedly.) You take that byte, multiply it by the size of each tile (which is the number of pixels per tile times the size of a single pixel, all of which are powers of two so the multiply here is really a shift), add it to the “tile start” address, add the “tile offset” defined above, and that gives you the address of the pixel. Then you fetch the actual pixel value. (If the pixel value is itself paletted, now you run it through the same palette processing described above.)
Regardless of the above steps, you now have an RGBA pixel. At this point you multiply by the RGBA multiplier for the span, add the RGB addend for the span, and composite the pixel into the appropriate spot in the line buffer, and move on to the next span.
The above flow can get fairly complex, especially in the tilemap case. To make this effecient the pixel processing is pipelined; I’ll talk about what this means at a later date.
August 8th, 2009 — joaf
In the last few posts I’ve described the mechanism(s) by which a scanline synthesized by Whiz ends up being displayed to the screen.
Today I’ll start talking about how the synthesis works.
Whiz processes a scanline at a time into a “line buffer”. To do this, Whiz uses “painter’s algorithm” — it draws the scene (or rather, the parts of the scene that intersect a given scanline) in back-to-front order. Objects “in front” of other objects are drawn later, which means they overwrite the further-away objects, which means they obscure them.
Because the FPGA has multipliers available, Whiz can perform “alpha blending” (specifically, the “over” operator). The alpha value specifies a pixel that is partially transparent. If the pixel is transparent, some amount of the background will “show through” the pixel. This is useful for things like water and smoke effects, but it actually can offer a huge boost in visual quality by using it to implement “anti-aliasing”. If the edge pixels of a sprite are made transparent in a certain way, so the background shows through, it will appear visually equivalent to how the pixel would appear if the sprite actually had a more-defined edge within that pixel (e.g. the true shape ought to only actually cover about 2/3rds of a given pixel along the edge of the shape, so the pixel has an alpha value of 2/3rds which produces the right output).
Alpha-blending requires source data with alpha, so it may not be all that common. Storing the alpha value also takes up extra space. However, for paletted images it’s not a big deal, and since there’s no support for 24-bit images, only 32-bit images, it’s not a big deal in that case either. The only case where it’s significant is for 16-bit images, which go from being ARGB 1555 or RGB 565 to being ARGB 4444, which may be noticeable.
On the other hand, Joaf supports “dynamic recoloring”, in which each sprite can come with an RGBA 8888 value which multiplies each pixel before it’s displayed. In the case of alpha blending, this means that even if no sprite has alpha data, you can use this to fade the sprite in/out smoothly (as the multiplied alpha value goes from 1 to 0, the object fades out).
Anway, the basic algorithm for Whiz is to traverse an list of “spans” of pixels. Each span is the intersection of a sprite with a scanline. Whiz runs through the list, drawing that span into the line buffer, and then when it’s finished with the list, it’s finished with the scanline, so it can swap/flush the line buffer and start working on the next one.
In online mode, if Whiz doesn’t finish in time, Iris needs the buffer now, so I see two reasonable options. One is that Whiz abandons the buffer and advances to the next one. The problem with this is that since Whiz draws back-to-front, the stuff that wasn’t drawn will be the frontmost sprites, which are probably the most important ones. So having those disappear would really suck.
The second option is that Iris might redisplay the current line buffer (skip advancing it), and then Whiz can finish the one it’s on. This will cause a weird shifting; one scanline will get repeated, and then the next one will be displayed one scanline down. You wouldn’t want to repeat this (there’s no good way to catch up), so Whiz would then skip processing a scanline. If Whiz took too long processing the next scanline too (which seems likely), what you’d end up with is a region where the screen was displayed at half-resolution, and shifted down by an additional pixel.
I’m not sure which of those tradeoffs is more acceptable.
One possible improvement would be to switch from back-to-front rendering to front-to-back rendering. Front to back can be done in existing GPUs with “destination alpha” and produce the same result as back-to-front alpha blending. However, it gives up the possibility of a multiplicative blend mode (though I think you can still do additive). However, if you store a separate per-channel RGB destination “alpha”, I believe you can make back-to-front work and still support a multiply blend mode as well. This would require doubling the size of the rendered line buffer, but that’s not too bad since the other line buffers are unaffected. It also means you have to clear the line buffer every frame. This is a minor drawback; in back-to-front mode you don’t need to clear it, you can just require the user to display something that touches every pixel (since they usually will be), and then it lets demo coders do crazy stuff where they intentionally don’t clear the previous line buffer and thus get weird vertical smearing effects.
Doing the front-to-back thing would be novel and clever but it would still have its own weird artifacts when time ran out — is the background turning black on some scanlines really appreciably better than the foreground dropping out? Obviously games should be tuned so they never have dropouts, so what we do in this case probably doesn’t matter that much. But nobody can test every case, so we’re talking about what happens when something unexpected happens.
In old consoles, when there was “too much stuff” going on, they ran out of sprites, so they just multiplexed the sprites, making each object flicker because they only displayed every other frame. You can obviously do this in software in Joaf as well, but the problem is knowing that it happens, since there aren’t static sprite limits.
Probably I’ll just do whatever’s easiest first — I think that will be making Whiz bail and swap line buffers and start the next scanline — and see what that’s like.
August 4th, 2009 — joaf
Last time I described how the graphics synthesizer Whiz sends graphics to the display generator Iris one scanline at a time via three line buffers. But there’s another reason why 3 is the magic number.
That post described “online” mode, in which the scene is synthesized just-in-time by Whiz to feed Iris. (Iris is always “just-in-time”, since Iris is generating the actual display and cannot fall behind without breaking the display.) The awesome thing about classic 2d game consoles that is also true in online mode is that your game just always runs at 60 hertz, because it pretty much has to. (Well, it could take longer than a frame to build the display list that drives Whiz, but it seems unlikely. At a minimum you could always have the player and some objects move at 60 hz; similarly, scrolling is cheap, etc.)
Because (as I’ll discuss in a future post) Whiz supports rotated sprites, it’s actually possible to use Whiz to generate Doom-style graphics (for the technically savant, the key concept is “affine spans”). However, for various DRAM/cacheing reasons, it seems unlikely that Whiz will be able to generate Doom-style scenes at 60fps, even for small screen sizes (e.g. 320×200, or whatever NTSC interlaced shakes out as).
So one of my innovations (I think) is that Whiz supports two modes: “online mode” being the mode described previously, and “offline mode” being one that supports rendering scenes at less than full frame rate.
As mentioned, Iris needs to always display at full frame rate. So the way offline mode works is there is a traditional fullscreen framebuffer. In offline mode, an offline-only part of Whiz is responsible for fetching the framebuffer and feeding it to Iris. To keep complexity down, Iris reads from one scanline buffer while this offline-only Whiz task writes to a second scanline buffer. When Iris finishes a scanline, it switches to the second buffer, and Whiz starts filling the first. (It would be possible to synchronize them more closely instead, but this shares more logic with online mode and allows Whiz to drive the DRAM at max speed to fetch a full scanline.)
Meanwhile, all of the rest of the Whiz circuitry is still present and active, synthesizing a scanline at a time into yet another scanline buffer (a third one). When this scanline is finished, it is flushed out to memory (to a second framebuffer–there need to be two framebuffers just as in normal graphics). Because this third line buffer isn’t ever sent directly to Iris, Whiz can take its time generating it, and end up generating at lower frame rates (30, 20, 15, variable, whatever).
Possibly there should be a fourth line buffer so Whiz can start rendering the next scanline without having to wait for the previous one to be flushed out to memory.
Regardless, one of the nice things about this design is that you can use exactly the same code to build a display list and feed it to Whiz whether you’re running offline or online. For example, you could have the game run at 60fps but in-engine cutscenes run at a lower frame rate, yet both use the same drawing technology (one is just more complex scenes). Or if you detect that you’re failing to hit 60fps you could switch over temporarily to offline mode.
Another consequence of this design is that as long as I specify a separate set of registers to determine the screen size Whiz is synthesizing in offline mode from those that Iris is displaying, Whiz becomes a “general purpose” blitter that can be used arbitrarily by the game to generate its display or even for other tasks.
August 2nd, 2009 — joaf
Whiz is the “graphics synthesizer”, although, as we’ll see, if you prefer you can think of it as a blitter on steroids.
Iris generates the video signal in the standard way video signals are generated–left to right, top to bottom. Therefore the “inner loop” of Iris is to generate a single line of the display (a “scanline”) from left to right. The outer loop is to generate each successive line of the display.
Iris displays a single scanline by processing a “line buffer” from left to right and displaying the values in them–the inner loop. When Iris finishes the line buffer, we step through Iris’ outer loop: advancing to the next scanline. To do this, Iris switches to another “line buffer”, which will already have the next scanline’s data — because Whiz will have already filled it out.
So Whiz and Iris are in a producer-consumer relationship. Whiz needs to get done with synthesizing each scanline “just in time” to hand it off to Iris. If Whiz isn’t done, tough luck, it gets handed off to Iris anyway.
We could get away with two line buffers: one that Iris is displaying, and one that Whiz is filling. Or we could have three or more. (When talking about framebuffers, two is called ‘double-buffering’ and three is called ‘triple-buffering’.)
At first, three or more sounds magically superior–if you only have two, and Whiz finishes updating a given scanline, the other one is still being displayed by Iris so Whiz is forced to wait. Having a third one available allows Whiz to keep going.
In practice, this advantage is a lot smaller than it sounds, because you have to look at the steady-state. In the steady state, Whiz finishes with the third scanline and is forced to idle–the third doesn’t help that much.
What more scanline buffers do help for is variability–if some scanlines are more expensive than others, than additional buffers can smooth that out. But, again, there’s a steady state problem. If you have “cheap enough” scanlines, then you hit an expensive scanline that process more pixels and makes you “fall behind” a little, it’s highly likely that the next few scanlines will also need to draw just as many pixels and you’ll keep falling behind until you fall too far behind.
However, there is one exception to this: although number-of-pixels or number-of-active sprites is highly correlated from one scanline to the next and hence suffers from this effect, number-of-sprites-starting-on-a-given-scanline is not. For example, if you have a screen split into two segments, or if you have a bunch of text, you may have a bunch of extra sprites starting on a single scanline. A third scanline buffer may give Whiz the extra buffering it needs to recover from having to process a large number of sprites on a single scanline. (Why is adding a new sprite starting on a given scanline more expensive than processing an existing sprite on a scanline? Existing sprites are already sorted properly, whereas new sprites have to be sorted into the existing sprites. I’ll talk about this more later.)
Therefore we settle on three scanline buffers as the ideal number of buffers for display. More wouldn’t really hurt, but since they’ll be implemented as block RAM we don’t want to get too excessive with them.