Joaf memory latency 2

Last post I was talking about DDR SDRAM, and the complexity of using it.

What about SRAM? I originally assumed it would be too expensive, but let’s take a look. Looking into part pricing on digikey, it appears that 256k x 16bit 10ns Asynchronous SRAM can be had for $5-10, so 1MB would be $10-20 (on top of $20-30 for the FPGA). That’s a reasonable tradeoff if the Asynchronous SRAM makes sense.

While the operation is conceptually simpler, it’s still pretty messy; my FPGA book’s “simple” SRAM interface requires 2 50Mhz clock cycles of latency (i.e. 40ns, despite the part being 10ns), and requires another cycle in between operations. Even bringing it down a cycle, and using 2 16bit parts, that would be a bandwidth limit of 32bits x 25Mhz = 100MB/s. It’s possible to achieve higher performance but you have to get messy and use FPGA-specific features. With 10ns RAM the limit appears to be 20ns (given off-chip delays and such).

Note that in the DDR design I was assuming I would always read 64 or 128 bits at a time because the latency was so high and it’s page mode, so it was better to add a tiny bit more latency and improve the bandwidth. Here the latency is low and there’s no fast repeated reads, so I’ve doubled the SRAM chips (from what my starter board provides) to double the bandwidth up to 32-bit.

Assuming no caches, every pixel Whiz writes would be accompanied by an SRAM memory read, so if the SRAM read takes two clocks back to back at 50Mhz Whiz would be limited to 50/2 megapixels/second. Given the display requires 4.5 megapixels/second, this limits us to a depth complexity of about 5, and we don’t get a speedup from using paletted data. In the naive design the performance limit is 50/3 megapixels/second, for a depth complexity of 3.

If we assume a streaming cache (just cache the last 32-bit read), then we would be limited to the above rate for rotated data, but non-rotated paletted data could process 4 pixels without needing a memory read. Then the pixel pipeline should be written to always write every clock, and will need to pause whenever it goes to memory; or we could build into the pipeline a two-cycle latency and only pause it if back-to-back reads are required. (We’ll also require reads for the character lookup.)

That’s probably a reasonable middle ground, so I’ll proceed with this design.

The Cyclone II starter kit I have only has 1 SRAM, so I’m more bandwidth limited–I can only read 16 bits at a time so pixel formats that use 16 bits will be limited by SRAM speed, and other accesses to memory will be slow. If I switched to Spartan 3 I believe those have 2 256Kx16 SRAMs so I could test the full performance.

Joaf memory latency

I posted before that where I stopped Joaf was because I couldn’t figure out DRAM latency to finish the paper design.

I finally tracked it down here.

Using Xilinx’ DRAM controller, the expected latency from initial memory issue to readback of the first bytes is 17 cycles.

If I understand it correctly, this is with the memory interface clocking at the memory bus speed, e.g. 166Mhz or 200Mhz (for DDR2-333 and DDR2-400 respectively) — there’s a separate app note than the above for 400Mhz DDR2. I presumably won’t be able to make the rest of the system run at 200Mhz, but I think it’s feasible to make it run at some integer fraction and have buffering to communicate from one to the other. (Why they wouldn’t have made that part of the interface themselves, I don’t know. Again, it’s kind of a gross part to have to implement up front.)

This leads to about 12M reads/second; fudging it a little to allow time for some extra cycles for a short burst read of some length, about 10M reads/second.

For 320×240 at 60hz, there are 4.6M pixels per second. (And I’d really like to target 640×480, but it’s clear this isn’t feasible while doing anything interesting.)

If we assume about 80% of memory time is used for reading pixels (the rest for display list reads, read/write from soft CPU, audio, etc), then we get about 8M reads per second. Assuming we can read & use 64-bits at a time per read (we can actually go faster than that if we stay within a row, so it might be better to define the interface as always reading even 128-bits), that gives about 64MB/s, or about 14 bytes per pixel.

With paletted images, that’s an average depth complexity of 14; with 16-bit images, an average depth complexity of 7. That seems like plenty! (But it would be pushing it for 640×480. Although we could consider 640×240, i.e. interleaved 640×480.)

The 200Mhz thing is actually fairly aggressive (can’t be achieved on the eval kit directly); 166Mhz is more plausible, which brings it down to a depth complexity of 10/5.

If rotating (or e.g. Doom spans), so each read produces only one pixel, we have a depth complexity of something like 1.2; which means we can rotate the whole screen and almost nothing else, or we could rotate individual sprites but not the background. Sounds reasonable!

The thing that rotation needs for performance is cacheing, but because of the line buffering approach, the cache has to last long enough to get to the next line to be processed. Non-rotated data won’t take up as much cache (instead of each displayed pixel requiring its own cache line, multiple displayed pixels come from the same cache line), but if we ONLY cache rotated data, we might make it more manageable.

Spartan-6 is the current Xilinx thing, but I’m not sure they’re actually available yet, so I haven’t found any prices. A reasonable Spartan-3a offers 360Kb of block RAM; if I assume I could use half of it for a cache. A Spartan-6 with a similar number of slices offers 936Kb-2Mb of block RAM; a slightly reduced version in slice count, but with more logic cell estimates, still offers 576Kb of block RAM. The current design puts all palettes in block RAM (say, 2048*32bits = 64Kb), plus all active spans (say, worst case of ~400 spans of 200bits is 80Kb) plus the current heads of all display lists (minimal). This requires something like ~150Kb of block RAM.

Two full rows of rotated pixels will need 640 cache lines. To make the cache addressable we’ll make that 1024 cache lines. Each cache line will need to store a few pixels to get useful cacheing; say, 64-bits. Each cache line will also need to store the line’s address; say 20 bits. So we’d need to store at least 80Kbits in a cache to make this effective. And hey, that sounds pretty feasible, in fact we can bump it up a lot more. Maybe on the Spartan-6 maybe we should just make all memory accesses go through a cache, although having the CPU cache interfere with the GPU cache might be bad, because we want the GPU to be predictable. But we can make all GPU accesses go through a GPU cache, and then a unified cache for CPU/audio. Given the size available on Spartan 6, maybe we bump up the GPU cache to 4096 entries = 320Kb of block RAM. This might make higher resolutions more feasible, too.

ZSAMIF – massive detuned oscillators

Just before the weekend before I got my monome, I saw a short clip from How To Destroy Angels showing the Swarmatron being played.

At this point in sGAR’s life, I had the basic software running, I had the grid sequencer and the mixer up, but the grid sequencer was controlling this crappy soft synth I had written for a videogame, and I wasn’t planning on shipping. I knew I needed a proper softsynth, and the Swarmatron was perfect inspiration. I would make a softsynth with tons of detuned oscillators; computers are fast and can do a good job of this.

Working on paper, I eventually came up with the performance-oriented design found in the demo below. A virtual keyboard is at the bottom, and then controls are in the middle and top. There were so many controls I realized I’d have to have subscreens with all the different controls. To allow you to play music with both hands and then change controls on the fly, I made a design where the modal controls (the ones that aren’t sliders) are duplicated on the left and the right. Leaving padding, this meant sliders in the middle could only have 12 values.

Thus the original ZSAMIF design was born: it would have 1 to 12 oscillators. (In the final version, the 12 buttons are remapped so the buttons further to the right add more than one oscillator, so the actual range is 1-32 oscillators, although I think more than 12 turns out to be pointless.)

Originally ZSAMIF was called MAZZIF, as a nice google-able misspelling of “massive”, but eventually I discovered there was already a softsynth named “MASSIVE” and I didn’t want people to think mine was a clone of theirs, since I’d never heard of it when I designed mine.

Since that video was made, I added many features. Most important is a “paramter change rate” control, which allows you to change a parameter in the UI but it only changes gradually in the synth. This lets you do things like PWM-sweeps automatically but user-triggered. There’s also a variant where parameters only change when you hold them down, and then change back when you release them. This lets you get a nice live-performance effect and was directly inspired by the How To Destroy Angels video.

ZSAMIF offers 4 waveforms: sine, triangle, square, and sawtooth. each one has two variants, which respond differently to the “pulse width” setting (which perhaps should be thought of as “morph” rather than “pulse width”).

All 32 oscillators use the same waveform. The oscillators are controlled in parallel; a single slider sets the “spread” for a given value for all the active oscillators. So the “detune” slider controls how wide the pitch detuning is. The “octave” slider determines the degree to which different oscillators are in different octaves (or overtones). The “pitch lfo depth” determines the degree to which the different oscillators pitch bend differently in response to the pitch lfo, and the “pitch lfo speed” slider determines the degree to which each’s oscillator’s lfo runs at a different speed from the “base” speed.

I also added the option for sequential ring modulation (where each oscillator ring modulates the previous). This is… weird. Ring modulation tends to drive a signal towards 0 because it’s multiplicative, so having more than 3 oscillators is usually a bad idea. The square wave is an exception, since it’s never 0, and indeed each multiplication by 1 or -1 still leaves the signal at 1 or -1. One of the two “pwm” square waves stays alternating between 1 and -1 and shrinks (the “naive” square wave PWM). The problem is this creates a DC bias (as it’s more often 1 than -1), and when you overlay dozens of oscillators, the DC bias creeps up to the point where it can be at the limit of the range of the audio. This is the source of distortion/cut-out about halfway through the video. However, while this mode is bad for regular oscillators (so I planned to fix the DC offset by making it vary between, say, 0.25 and -1 when it’s at 80% high), it’s awesome for ring modulation since the signal loses no energy at all. The ringmod PWM turns out to be one of the most interesting sounds I’ve discovered while doing this.

There are many, many more controls.

From an implementation standpoint, ZSAMIF doesn’t use any of that clever “minBLEP” stuff that is needed to get high-quality anti-aliased digital waveforms. I just generate with 4x oversampling and filter it down. This is because minBLEP-style stuff isn’t compatible with the arbitrary waveform “PWM”-ish stuff I’m doing (maybe–it’s definitely not for sine waves). Plus it would have been a bunch of hard work and I just wanted to bash it out. If you put in high enough overtones/octaves you can definitely hear the aliasing when using the detune to make the pitches move; presumably it’ll be more fairly obvious in those extreme cases once I put in a monophonic mode with portamento. Perhaps someday I’ll revisit the possibility of a minBLEP-ish implementation (I have one in another directory).

sGAR feature list and initial programming

This is the raw list of pages from the design document as it currently stands:

1. Mixer
2. Track Editor
3. Track Pitch
4. Pattern Editor
5. Instrument Selector
6. Note Performer
7. Pattern Performer
8. History Rewinder
11. Track EQ
12. Song Loop Layout
13. Instrument Paint
14. Track Ordering
15. Set Tempo
16. Analog Synth
17. Step Filter
18. Guitar Looper

Each of these has a sample layout and a lot of text explaining the buttons and their functionality. When I first started programming sGAR, the list was probably 75% as long, and it’ll be longer when I’m “done”, if there’s any such thing (since I’ll probably just always put random whatevers in this).

I wrote a little OpenGL grid display and made it so you could click on buttons with the mouse. To support “multitouch”, I made it so you could click on the grid with the left mouse button, hold the button down, then click somewhere else on the grid with the right mouse button, and as long as you kept holding the left button down, the corresponding button stayed down.

I built the app in a “framework” that was just a copy of the source to an old unfinished videogame that had a soft-synth and some drum samples built-in for doing a simple music track in the game.

So I started programming the grid-based note editor, and I wired it up to trigger the existing bass sound from that videogame soft synth. Wanting to make it sound more interesting, I went on musicdsp.org and built a graphic equalizer and hooked it up in sGAR. Then I went back to musicdsp and pulled down the formant code off that. It’s kind of dumb — just five vowel sounds — but hey, it’s code and it just works without me thinking about it. Created a little “random formant start/end sound” screen in the mixer and voila. A lot of craziness even if I was restricted to a dumb synth bass sound.

sGAR design documentation

sGAR began life as a text file I wrote documenting the design. I began writing it before it was 100% locked down that I could get a 256 soon, but I was impatient to start.

I find it is very useful to document the design on paper, in this case with a little 16×16 grid of characters, because it’s a lot faster to design on paper than it is in code. So I tend to write a fairly thorough design document, trying to cover every aspect of the thing, although it’s not meant for end-users so it’s terse, albeit exact.

For example, here’s one of the “pages” as originally designed, which is basically an mlr ripoff (without me ever having used mlr, so randomly different, and also designed to integrate in a complex way with two of the other pages, so missing things like groups). Note that some of the user interface ideas written down are contradictory, and some turn out to be bad ideas.

I also wasn’t sure how much PWM I was going to be able to get away with to create intermediate intensities (as it turns out, essentially none: I’m using one level, but it’s very flickery), but I quickly found I couldn’t really do effective designs without using some PWM, so I assumed I’d be able to use one level. Thus “0″ is on, “.” is off, and “+” is 50% brightness.

=== page Pattern Editor ===

0 0 + 0 0 + 0 + 0 + 0 0 0 + 0 0
0 + 0 0 + + 0 + 0 + + 0 0 + 0 +
0 . 0 . 0 . 0 . 0 . 0 0 . 0 0 .
0 . . . . . . . . . . . 0 0 . .
+ + + + + + + + + + + + + + + +
0 0 + + 0 0 + 0 + 0 0 + + + + +
. . . 0 . . . . . . . . 0 . . .
. . . . . . . . . . . . . . . .
+ + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
+ + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
. . . . . . . . . . . . . . . .
0 0 . + + . + + + . + + + . 0 +

This page effectively shows all the note editors
for all tracks simultaneously, by collapsing out
pitch information.

In the sequencer, there are up to 14 tracks, each with up
to 16 “time events”. Normally each time event lasts for an
eighth note, but see discussion below.

Each row corresponds to one track, each column to one
time event. A given coordinate is lit up if there is “data”
at that event time. (A note in the note event, or above-threshhold
audio in the beat, not counting the first N milliseconds.)

The 15th row selects whether each event lasts for an eighth or a quarter
(or, equivalently, a sixteenth or an eigth). This allows time signatures
longer than 16 beats at the cost of only being able to have events
occur on 16 of the beats. So for example 5+5+5+2 could be programmed
as ‘O… …. …. ….’ with the first event lasting for a quarter and every
event after that an eighth. Swung time can be programmed by
alternating long and short beats.

Pressing a button causes that track to restart
at that point in time. Framing (both-handing) a
horizontal or box causes that track (or tracks)
to loop that region of time. If the dragged box
includes beat 1, it will synchronize with the
current looping rather than jump in time.
The time signature can be changed by dragging
out a box covering all the tracks. (This is actually
detected as a special case, and the pattern loop
controls are altered. Maybe should make an
explicit control for it.)

Pressing two buttons on the same track at the
same time causes the track to play from both
places at the same time.

LMODE is a shift key allowing specifying the length
multiplier for each track. Press LMODE then MMODE to
lock LMODE one-handed.

. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. . . . . . . . . . . . . . . .
0 0 . + + . + + + . + + + . 0 +

Each track can take up
1 or 2 beats, or 0.5 measures, or 1 measure, or 2 measures, 3 measures,
up to 8 measures.

RMODE allows button presses to be recorded,
so a given track will play out of sequence according
to the button hits.

At this writing this page is entirely unimplemented; but that’s kind of the point, even the pages that are implemented started life like this.

sGAR overview

sGAR is a collection of music creation tools which use a monome 256 as a controller.

I’ve always toyed with the idea of writing a guitar effects processor or a audio music editor using a PC. Fortunately Reaper came along and did a good enough job of the latter that I’m not tempted anymore.

Lately there have been a few Flash-based grid music-creation tools (typically in a pentatonic scale). The mouse isn’t really a good interface for it, but it also reminded me of the great experience of programming the Roland TR-707, which is where I cut my teeth learning what drum parts sounded like, and remains to this day how I visualize drum patterns.

Cut to discovering the monome: this was just perfect. Grid-based editing, but you can hit multiple buttons at once. Actual buttons, not sloppy multitouch (although this is a clever design for using multitouch that I think would have worked really well; shame about the touch surface company being acquired by Amazon and dropping the product). Programmable. Open source.

Sounded like just a thing I could use for all my music software needs, and ones I didn’t know I needed.

I bided my time, hoping to get my hands on a monome 256 (a 16×16 grid), but monomes are built in very small qualities (since they’re handmade by a single couple). I bid on the 512s when they auctioned off 4 of them, but it was just a tad out of my price range.

Fortunately soon after that I was able to swing a deal for a 256, and there was no looking back.

joaf hiatus – RAM

So, I haven’t updated this blog in nearly a year, so I guess Joaf is officially on hiatus if not dead.

The issue I ran into was the problem of accessing RAM from an FPGA. To complete my paper design, I needed to get a ballpark idea of what kind of latency I could get reading from RAM.

The problem is that it depends on what RAM I chose, which was a free variable. Plus I couldn’t be 100% sure just reading data sheets, because interacting with RAM is kind of confusing. And if I was wrong by a factor of 2 in the wrong direction, I’d end up with something which could only run at half of a reasonable frame rate.

The reason the design is latency-limited, not bandwidth-limited, is because it didn’t seem feasible to use an on-chip cache for the graphics. Because of the row-at-a-time design, you could get locality on a single scanline, but by the time you got to the same sprite on the next scanline, any reasonably-sized cache would be flushed. And if you rotated a sprite you’d get basically no locality at all. So a design where it fetched every pixel explicitly from RAM seemed a lot more predictable than trying to rely on cache; I could say rotation was free.

But I couldn’t figure it out on paper, and so it seemed like the only way to figure out what was feasible and plausible was to actually implement a DRAM interface, but that stuff is pretty far down towards the deep end of FPGA development… it seems to be the really hard stuff, and not where I wanted to start when it came to actually programming the FPGA. (I wouldn’t have to do the final version, just a prototype, but even so…)

All of which added up to an impassible barrier to further development: I just couldn’t collect the data I needed to move forward with the paper-design-so-far.

Unless I can find a way to figure that out, it appears the project is over.

Next I’ll post about the new, non-Joaf project I started a couple weeks ago.

sprite list offsetting

Last time I described how sprites are represented as “spans” and how they can be sorted together to be used later with painter’s algorithm.

The input to that step was a set of “sprite lists” which Whiz will sort together; each sprite list is itself sorted (top-to-bottom, back-to-front). Whiz keeps pointers to each sprite list in block RAM and uses those pointers to fetch one sprite at a time from DRAM. I’ll talk about how to get those sprite lists into DRAM in the next post. For this brief post I want to put some finishing touches on sprite lists.

Each sprite descriptor loaded from DRAM consists of 2 or 3 64-bit words. There are 8 different descriptor formats; this effectively compresses the descriptor which might take around 4 64-bit words if it could express every possible aspect in one format. (I don’t know if that complexity is worth the memory/bandwidth savings; because MC exists it doesn’t affect Poet-side performance.)

Somewhere in each sprite descriptor is an x coordinate, a y coordinate, and a z coordinate. When we load each descriptor from DRAM into Whiz, we expand it out to the full 4×64 description. While we’re doing that expansion, we can add a global sprite-list offset to each of x, y, and z. As long as none of those values wraparound, the sorting of the sprite-list against itself will be unchanged (since the same value is added to every sprite). That means the post-addition list is still sorted properly, so it will still interact correctly with the rest of the system.

Note however that we don’t allow scaling or rotating the entire sprite list. We don’t allow sprite rotation in general (a rotated sprite is no longer rectangular–we require the user to describe a rotated sprite as a series of separate 1-line sprites). Scaling down might cause objects starting on adjacent source scanlines to draw starting on the same scanline, so allowing that would require some resorting of the list. Scaling up might be possible, but it’s a direction that just exposes imprecision (e.g. the lack of subpixel coordinates on the source data) and the other restrictions make it feel non-orthogonal. And it seems wasteful to reserve FPGA multipliers for this possibly rarely used feature.

But just addition is pretty powerful. There’s a whole complicated system of sprite maps to allow you to do big blocks of things that move together for backgrounds and e.g. parallax scrolling. However, sprite maps are doubly-indirect per pixel (or, possibly, they might be written to by two nested loops, although that’s painful if scaled and implausible if rotated). Sprite maps still require pixel processing in open areas (even if we say that tile id 0 is always blank and can be early-outted, once we’re pipelining pixel operations there’s no meaningful way to early out–the only way to get this savings is if it were done with nested loops as mentioned earlier–but that way also is worse with pipelining if it requires pipeline flushes).

Sprite-list offseting offers another way to create big moveable mass of stuff, by building the background as a sprite list. There are a few differences: maybe faster because it’s not doubly-indirect. Sprite elements don’t have to be square. Sprite elements can overlap each other. (E.g. a complicated boss monster. E.g. a big map of square tiles, except each tile extends off the edges a little with alpha blended edges.) Things could look like Braid.

To make this work we need to handle offsetting a sprite list off the edge of the screen. We explicitly don’t handle a sprite list being the entire world; it will always have to be a screen-sized chunk of the world. This gives us fewer sprites to cull and requires less precision for the x & y coordinates.

Let’s pretend the eventual application of this architecture is a screen that’s 1280×720. (We won’t handle this day one, but we forsee FPGAs becoming fast enough that that resolution is possible, so let’s see what the math bears out.)

The x and y coordinates for a single screen need to range from 0..1280 and 0..720. If we restrict the largest representable x value as 2048, then we would have to limit the offseting to 2048-1280. In fact, we don’t want to have to enforce an assertion like ‘offset < 800′, because we’re in hardware, so instead we enforce an assertion like ‘-1024 <= offset < 1024′, which enforces some limited number of bits.

For 1280×720, we could limit the offset to -1024,1024 in x, and -512,512 in y. This would force the use of smaller-than-screen-sized sprite lists (if they’re scrollable). Or if we want to allow screen-sized sections, we need to use -2048,2048 in x and -1024,1024 in y.

We then have to accurately store these numbers (without wraparound), so we’d need to store (0-2048, 0-1024) and (1280+2048, 720+1024), which means we’d need to store x as -4096,4096 and y as -2048,2048. This requires 2 extra bits for y and 2 extra bits for x. (In fact, it’s easy to show that support-a-screen-sized-section will always require 2 extra bits.) A non-screen sized section needs to allow a max of (1280+1024, 720+512) which turns out to still need the same number of bits of post-offset x and y. (This might be provably true, I’m not sure.)

Oh but wait! I forgot to mention that the original x & y values need to be able to be negative (because it’s too annoying to force Poet to do the clipping for them, given that we have to have clipping hardware for the sprite list case). And they have the same bit representation issue (it would be weird for the hardware to blow up if you go outside some legal range). So that suggests initial x should be able to be [-2048,2048) in the first place. Fortunately this still doesn’t affect the output range (again, probably provably).

Now, if we suppose our actual resolution is 320×240, then this works out to sprite values being: [-512,512) and [-256,256) and offsets being [-512,512) and [-256,256). Note that we’d like to encourage using smaller-than-screen-sized sections, because they’ll require less processing of clipped sprites; so we could shrink the allowed offsets to half the size to force that, but then we’d be wasting output precision (which has to round up to the next power of two). So instead we expose all of the precision and just advise users to use smaller sections. They’ll be able to see the performance difference for themselves anyway.

That’s about all there is to that. Each sprite list will be programmed as a pointer to the sprite list, and the x/y/z offsets. The x/y/z offsets won’t be stored in the sprite list itself (e.g. in a header of it), because (a) there’s no header currently, and (b) because this allows you to instantiate multiple copies of it on a single screen.

Note also that by design a sprite list like this can be in ROM; that is, there’s no requirement for Poet to construct it on the fly (like there is if you have no x/y/z offsetting). Thus you can easily imagine cases for using it trivially. Use it to directly represent static scrolling backgrounds. Have 30 variant ROM sprite lists to represent an animating boss character, and pick the right one and position it each frame. (This is why I assume the 2/3 64-bit word “compressed” format is worth it; to save on ROM space and not require full-fledged decompression.)

One limitation with this approach: how do you do non-full-screen-sized scrolling? This is easy with tile maps; the size of the tile map sprite rectangle is the size of the scrolling screen. This method inherently relies on the screen clipping to determine where it renders. Unless we add clipping to some other rectangle, it can’t really be used for anything else. Well, you could draw as a background in a smaller rectangle by giving it the smallest z value, and then drawing some other background over it whereever you don’t want it; but there’s no way to e.g. do a split screen with two separate layers.

When I actually write the clipping hardware I’ll look at how hard it would be to do arbitrary rectangle clipping instead of screen clipping, but I don’t see any other easier approaches.

span sorting

Last time I wrapped up a “tangent” explaining in detail how one pixel is computed and how the pixels are computed in parallel using pipelining.

To return from the tangent, then, I’m going to pick off where I left off in describing how data flows from the CPU to the display, which I’ve been doing by walking backwards through the process.

As described so far, the screen is drawn as a series of rendered line buffers; to draw a single line buffer, Whiz process a list of “active spans”.

Each span actually represents a sprite rectangle. While Whiz draws into a given line buffer, it creates a temporary copy of the span and updates certain data that needs to change “horizontally”. Additionally, Whiz updates the span “vertically” and adds it to the new list of active spans for the next scanline.

So really each span descriptor isn’t a span descriptor, but a “bottom of a sprite” descriptor–each one starts describing a full sprite, but then as the sprite is displayed it is updated to only characterize the parts of the sprite that are left. For various reasons I’m going to continue to call this a “span”, anyway. (Basically, when it’s a sprite, it has an explicit Y value; but once it starts being displayed, it no longer stores an explicit Y value, and I stop considering it a sprite.)

The list of active spans contains one entry for every sprite which intersects a given scanline. When the last scanline of a sprite is reached, the span is deleted (or rather, isn’t added to the list for the next scanline). When the first scanline of a sprite is reached, the sprite must be added.

The list is drawn in order, so the list must be sorted in “painter’s order” (back to front).

Any given pair of sprites will always draw in the same painter’s order. So as long as we keep the active list in the same order as we updated it from one line to the next, we don’t need to re-sort the sprites.

What this means is that we therefore only have one actual problem to deal with: adding new sprites into the active list in sorted order.

Suppose we have a list of sprites we want to draw in a given frame. Because of the way we process a line at a time, we’ll need to add sprites that appear higher-up on the screen to the active span list before adding sprites that appear lower-down on the screen — the sprites that appear higher will need to start getting drawn into the line buffers sooner.

The upshot of this is that we need to add sprites in the following order:

  • Sprites are added in order of the first scanline they draw to
  • Sprites on the same scanline are added in z (depth) order
  • Sprites added to the active span list must be positioned in correct z (depth) order relative to the existing active spans

The job of adding sprites to the active span list is entirely Whiz’s job — it will be processing a list of sprites and updating the active list. But the task of adding them in the right order can’t be solved by Whiz — we can’t just hand Whiz a pile of sprites and expect it to do the right thing. So instead we can just require the sprite list created by Poet to be in the right order:

In a sprite list provided by the CPU, sprites must be sorted in order first of their starting scanline, then in their z-order.

The sprites must still have an explicit z-ordering value; Whiz uses this to merge the new sprites in with the active list by comparing their z values.

However, forcing Poet to fully sort all of the sprites all the time isn’t particularly ideal. That sort is expensive, and Whiz has some free cycles to be able to do something smarter. So Joaf supports multiple sprite lists (each of which must be sorted as above). In a given screen, you might use 2 “static” sprite lists for separate scrolling backgrounds, one sprite list for a boss monster made out of multiple pieces, and one sprite list for all the general moving objects. In general, you can avoid sorting entirely by intentionally designing the display in ways that allow the multiple sprite list usage, and being able to construct each sprite list explicitly without sorting.

Note that there is a limit to the number of static sprite lists due to their use of block RAM, and using more sprite lists increases the cost per sprite of adding them to the active span list, as described below.

I can now characterize Whiz’s algorithm for processing the active span list in each scanline, given multiple input sprite lists.

Process A:

  1. Wait until pixel-drawing unit is ready for a new span; then dequeue the head of the active span list; send a copy to the pixel-drawing unit
  2. Update the entry to the appropriate value for the next scanline; store to the “next active buffer”

Process B:

  1. Scan all the cached next-entry for each sprite list; choose the one that sorts first (smallest Y, then Z, then X)
  2. Copy the sort-first one into the “sorted next buffer”
  3. Load the next item for that sprite list into the cached-next-entry for that sprite list

Process C:

  1. If the Y value of the “sorted next buffer” is that of the next scanline, then if the comparison of (Z,X) for the “sorted next buffer” is less than that of “next active buffer”, do step 2 else do step 3
  2. Copy “sorted next buffer” into the active list queue; do Process B
  3. Copy “next active buffer” into the active list queue; do Process A

In practice process A is a little subtler; we always begin drawing the next span as soon as we can, and if it’s the last scanline of a sprite it won’t need to output an entry into the next active list anyway. So that should run ASAP without waiting until the next-active is ready to be output. So there’s actually another buffer there.

There’s a limit to how many active spans you can have, based on the block RAM usage for storing the active spans. (I haven’t decided for certain what the limit is, since it can trivially use multiple block RAMs to store different parts of it.) A Doom-style engine draws each column of a wall as a separate span, so I probably want to support ~500 spans at once for that case (320 vertical spans plus overdraw).

The cost of merging a new sprite into the active list is very small, because we use a simple merge between the two lists. The actual cost is limited by the cost of computing the next item to compare. For checking the existing active spans, this can get blocked by the cost of actually rendering the thing–we can’t compute a new one until we finish rendering the previous one. A one-item buffer here mitigates the worst-case cost (e.g. a span that covers the entire scanline, e.g. a scrolling background), but only somewhat. Then again, if span-rendering were to be so slow that it slowed down active-span computation to the point it didn’t complete in the time alotted for one scanline, the span-rendering itself would probably have that problem too.

The other side of this is more problematic. There are two limiting factors on the cost of computing the next new sprite to merge. One is the cost of loading the next sprite list entry from memory. We cache the first pending entry from each list in block RAM, so we only pay this load cost when we serve a sprite onto the active list. The worst case for this would be when we start a whole list of new sprites, all from the same sprite list. E.g. this will happen at the top of the screen. The second cost is the cost of scanning. With only one sprite list, we always compare with that sprite. With multiple sprite lists, we have to compare all the sprite list entries to each other and pull out the first one. This requires O(N) scanning time for N sprite lists, which will force N to stay small.

Overall, this shouldn’t be too much of a problem. If we run at 320×240 and support say 2x overdraw, then our pixel pipeline has to process 640 pixels per scanline, which, including overhead per span, probably requires around 800-1000 cycles per scanline. Alternately, 40Mhz (an arbitrary target) gives us about 2800 cycles per scanline. The triple-buffering of line buffers lets us suck up the cost of the occasional expensive scanline (like having a bunch of sprites start on the same scanline), and if in typical use we started, say, 20 sprites per scanline, even if we had to scan 10 lists this would still only require 200 cycles per scanline to sort. So overall it seems pretty good, but obviously the numbers could eventually get too large.

It might be possible to do a slightly improved strategy: keep track of which sprite lists have sprites starting on a given scanline, and only scan those sprite lists. Suppose we had 30 sprite lists (to avoid doing any CPU sorting at all), but we have 100 sprites starting on the first scanline. This would require 3000 cycles to process the original way. If we instead know that only 3 of the lists actually have any sprites on that scanline, then we can reduce the cost down to 300 cycles instead of 3000.

To make the sprite lists more useful, they can be “repositioned” (as if they were sprites themselves), which I’ll talk about next time.

pixel pipelining

Here’s the pseudocode for the beginning of the pixel processing from last time:

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    MEMORY.address := address + address_row
    MEMORY.read_8bits = 1

  4. [wait for read to complete; latch result into 'readresult']
  5. ; compute the address of the sprite data and the current pixel
    mapx := span.S[11..6] >> map.S_shift
    mapy := span.T[11.6] >> map.T_shift
    address_row := readresult << map.tile_shift

  6. ; continue above
    address := map.image + address_row
    mapx := mapx << map.pixel_shift
    address_row := mapy << map.pixel_rowshift

  7. ; continue above
    address := address + (mapx | address_row)

Within each step, all of the operations occur in parallel; by chance none of them happen to read and write to the same register in a single cycle, so it’s non-obvious, but normally I could have a single item that said ‘x := y; y := x;’ as two statements, and the point is that x gets the y value from the previous clock, and y gets the x value from the previous clock.

Now, let’s take the above fragment of state machine logic and pipeline it. To pipeline it, each state in the above sequence becomes its own little section of the hardware with its own inputs and outputs. Each pipeline step needs to know about everything about the current pixel. The stuff in the current pixel is stored in the structure named ‘span’ in the pseudocode, so that stuff can no longer be “global”–each pipeline stage must access a different pixel’s data.

We’re going to assume all the pixels in the pipeline come from the same span; that means all the references to the ‘map’ structure above stay the same. Let’s look at the first couple steps, and rewrite things to ignore the memory access timing problem (assume we’re getting the data from a fast cache and ignore what happens if the cache misses):

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address := address + address_row
    map_cache.read_8bits = 1

  4. ; compute the address of the sprite data and the current pixel
    newmapx := span.S???[11..6] >> map.S_shift
    newmapy := span.T???[11.6] >> map.T_shift
    tile_address := map_cache.result << map.tile_shift

Much of this is exactly as it was before; however, the numbers are no longer meaningful. Before the number indicated separate stages controlled by a state machine; now every one of those operations is occuring at the same time, and the numbers are just to show the correspondence, and to show how long it takes the data to have flowed down to a given stage.

The main change in the code is that we don’t reuse any registers; we’re just going to make new registers at every stage. Remember how I said each pipeline stage is its own little box with inputs and outputs? The registers actually live at the boundary between the pipeline stages–the output of stage N becomes the input to stage N+1.

(You can’t reuse registers because everything happens at the same time; if two different stages wrote to the same register (trying to reuse it), they’d both be writing to the same register simultaneously, which you can’t do.)

Technically, as described so far, you just need an array of simple flip-flops, i.e. a buffer, to serve as these input/outputs, not a register. A register adds the ability to keep its old value instead of updating. In normal execution, we never need this ability, but if the pipeline needs to stall waiting for memory, then we need to freeze the state of the pipeline before the stage that’s blocked — which means latching the state in the registers.

But what about those span.S??? things? The problem here is that if we’re pipelined, then by the time the data flows to stage 4, the S and T variables will have been incremented 3 times (since it takes 3 cycles to flow to there). At stage 4, we need what are now old S & T values. The audio way to think of this is that we need a delay line. The simple way to think of this is that each pipeline stage must have all the information for the pixel available, and if that info varies per pixel, it needs to be a pipeline input–which means it needs to output by the previous register.

So naively we do this:

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask
    S1 := span.S
    T1 := span.T

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)
    S2 := S1
    T2 := T1

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address := address + address_row
    map_cache.read_8bits = 1
    S3 := S2
    T3 := T2

  4. ; compute the address of the sprite data and the current pixel
    newmapx := S3[11..6] >> map.S_shift
    newmapy := T3[11.6] >> map.T_shift
    tile_address := map_cache.result << map.tile_shift

Remember how last time I mentioned we wanted to minimize the number of registers and the total register bits in the system, and that’s why I reused variables? That’s still true here. We propogate S & T through the system with far too many bits given that all we’re going to do at the end is extract a few of those bits and pack them together.

So what we should do is extract the needed bits as soon as possible. And I’ll make the bit counts in the registers explicit, since we’re trying to minize them:

  1. ; find the bits of S&T that index into the sprite map
    mapx[8] := span.S[19..12] & map.S_mask
    mapy[8] := span.T[19..12] & map.T_mask
    S1[6] := span.S[11..6] >> map.S_shift
    T1[6] := span.T[11.6] >> map.T_shift

  2. ; first half of computing the address of the sprite map index
    address[24] := map.base + mapx
    address_row[24] := (mapy << map.yshift)
    pixoff2[12] := (T1 << map.S_count) | S1

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address[24] := address + address_row
    pixoff3[12] := pixoff2

  4. ; compute the address of the sprite data and the current pixel
    tile_offset[24] := map_cache.result << map.tile_shift
    pixoff4[12] := pixoff3

Eventually, further down the pipeline, we’ll have to write things out, so another thing we’ll need to propogate down the pipeline is the current pixel’s x location. On the other hand, every x location will be one more than the previous one (or one less if the sprite is reflected horizontally), so it would be more efficient to store the span’s x location into an x register deep in the pipeline, latch it until the pipeline flow reaches that stage, and then just increment or decrement it there.

On the other hand, because we have block RAM available, it’s also possible we could try to store some of our “registers” to block RAM–specifically, when we compute something (or access it at the beginning) but won’t use it for some number of stages, rather than cascading series of hand-coded registers (which will use a bunch of FPGA resources), we can store all the data to a circular buffer in block RAM, and just propogate along the appropriate index into the circular buffer from which the relevant data can be loaded.

Block RAM has 1 read and 1 write port, which means that on any given cycle, you can write to it once and read from it once. So the above technique can only be used for data that is only accessed in one phase of the pipeline; the 1 write/cycle is used when initializing the “register” (either loading it from the initial pixel state, or setting it on the fly from some computation), and the 1 read is the read from one phase of the pipeline. It can’t be read from two phases of the pipeline, because the pipeline is pipelined, so all the stage happen at the same time (just accessing different pixels). In fact, each conceptual register must be stored in a separate block RAM, because we can only read/write once per cycle from a given block RAM, so if we tried to store multiple registers in one block RAM they’d end up being accessed from separate stages which will need to index different spots in the circular buffer.

Since the total number of block RAMs is limited, and this would only use a small amount of the storage available in each block RAM (I think they’re 4Kbits, or maybe 18kbits?) thus wasting most of it, I’m going to proceed with the design using explicit (non-block-RAM) cascaded registers. (This is, as far as I know, how real hardware has to implement pipelining.)

pixel pipeline

Last time I described the basic abilities of the pixel pipeline in terms of how each pixel from a span is generated one after the other, and the various properties each pixel can have.

This time I want to start drilling down a little and talk about what this looks like “in hardware” (or at least in the FPGA).

The most important thing to understand about hardware is that the magic comes from doing more than one thing at a time. Where traditional programming is inherently sequential, clock-based hardware is inherently parallel — every operation on the chip will happen every clock cycle.

Most hardware algorithms need to use sequential algorithms to accomplish their tasks, so they’re divided into a sequence of steps. A piece of “control logic” can then enable and disable each task on every clock; the upshot is that one small part of the hardware might be set up to have three steps, and to execute them in order, 1, 2, 3, and then repeat starting at 1. Typically such a process would work as follows: step 1 accepts input from the outside and process it to some temporary values; step 2 operates on those temporary values and produces new temporary values; step 3 operates on those temporary values and then passes them along to some other unit.

These controlled sequences may have variations; they might do different things after step 2 depending on some control inputs from elsewhere in the system, or based on the values of the data. In general a piece of localized, locally controlled sequential logic like this is called a “state machine”; it’s a machine that goes through a sequence of states which determine what it does.

If we look at the pixel pipeline defined last post, we have the following main operations that might need to be done (not counting updating variables for the next pixel, just to simplify things):

  • Use x coordinate to determine where in line buffer to write pixel
  • Use S and T coordinates to determine which sprite map tile to read
  • Read sprite map entry
  • Use sprite map entry to determine which sprite tile to read
  • Use S and T to and sprite tile to determine where to read pixel from
  • Read pixel
  • Use pixel and palette to find RGBA color in palette data
  • Multiply by RGBA scale color
  • Add RGB additive color
  • Blend to linebuffer address

In converting this to a state machine, we may find some of these can be done, and some take multiple steps. The end result might be something like this (this isn’t a real FPGA programming language, just pseudocode):

(“span” is the current-to-this-pixel data from the current span; “map” is the map selected by the current span)

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    MEMORY.address := address + address_row
    MEMORY.read_8bits = 1

  4. [wait for read to complete; latch result into 'readresult']
  5. ; compute the address of the sprite data and the current pixel
    mapx := span.S[11..6] >> map.S_shift
    mapy := span.T[11.6] >> map.T_shift
    address_row := readresult << map.tile_shift

  6. ; continue above
    address := map.image + address_row
    mapx := mapx << map.pixel_shift
    address_row := mapy << map.pixel_rowshift

  7. ; continue above
    address := address + (mapx | address_row)

and etc.; the above is only the first half or third of the pixel pipeline. In this case, the state machine is simple; each state executes immediately after the previous one on the list (with pauses waiting for the external memory to return results).

One thing you’ll see in the above pseudocode is that I kind of haphazardly reused variables. Each variable is a “register” in the hardware, and the registers use up hardware real estate, so it makes a lot of sense to resuse the register for different purposes at each stage. In fact, a careful design will presumably try to minimize the total number of registers (and the total number of bits across all registers, since registers don’t all have to be the same size) to deal with this.

However, I didn’t bother for this because this aspect will go out the window soon.

If we ignore the wait for memory, we can look at this and see how long it will take. The processing of each pixel has to flow through each of these steps. It’s possible we might be able to do more work in each step (e.g. add more than two numbers) or have each do less (perhaps we can’t do as many shifts per clock as the given design assumes), and these things will influence the clock rate–if we do a lot of work in step #2, then step #2 may take longer than the other steps to run, but we have to set the clock (which is global across the entire design) to a single rate, i.e. slow down the overall clock. This is why the address calculations are split over multiple cycles.

So, as written, the entire pipeline will probably work out to around 15-20 cycles for a single pixel (of the most slow-to-decode kind of pixel). We can now do a back-of-the-envelope calculation for this. If we can get the FPGA to run at 50 Mhz, and we can process about 3 pixels in 50 cycles, then we can process 3 million cycles per second. We know we want to run at 60 frames-per-second, so that gives us about 50,000 pixels per frame. This is a problem–it takes too much time, too many cycles, to process each pixel. Now, if you use simpler pixels, it can be faster, but this will significantly limit how many complex pixels you can use.

A 320×200 screen is 64,000 pixels, and to make a reasonable videogame we have to drawn more pixels than there are screen pixels–we’ll fill the line buffer with a background, and then draw sprites on top of that. In the most extreme, we might like to have something like 3 layers of parallax scrolling, which amounts to needing to draw something like 3.25 pixels per screen pixel. At 320×200 that would be 208,000 pixels, which gives us only 4 cycles per pixel at the proposed clock rate (and who knows if the rate is achievable). The use of sprite maps is the single biggest thing using up cycles, and the use of sprite maps is pretty much mandatory for stuff like parallax scrolling. So saying “use mostly simpler sprite types” doesn’t help.

But remember what I said before: the magic comes from doing more than one thing at a time. The trick is to “pipeline” the above pixel processing so it’s like an assembly line; we can make it so each pixel still goes through 15 steps of processing, and thus takes 15 cycles from start to finish, but once it’s out of the first step of processing another pixel enters that step, and so on down the assembly line, so in the end the limit is how fast the conveyor belt is moving (the FPGA clock speed), not how many stations there are along the conveyor belt (the number of clock cycles needed to process each pixel).

One of the neat things about pipelining is that, at least for simple things, the hardware description of a pipelined process isn’t much different than the non-pipelined one. Next time I’ll take the pseudocode above and alter it into pipelined form.

pixel processing

Last time I described the way that a final image can be sensibly created by rendering a single scanline at a time.

To date I’ve been slowly working backwards from the final output signal, showing how each process produces a given output from the previous input. The last article described a process with the input a “list of depth-sorted active spans”, so that’s the next thing I need to show where it comes from to continue working backwards to what the CPU actually gives to the GPU.

However, I’m going to break off for a second and drill down into the span rendering. Last time I focused on how sorting works and how the span buffers get managed, but I didn’t really talk in detail where the individual pixels come from.

Each active span carries two kinds of data: horizontal data (data for rendering a span), and vertical data (data for updating a span for how it should appear in the next scanline). For this post, we’re just going to consider the horizontal data, since that’s what drives the span rendering, and hence the actual pixels.

Each active span has an x coordinate and a length; these determine which pixels in the line buffer are written to. Each sprite then has information describing where the sprite data itself is stored (discussed a ways below).

Sprites can be recolored. There can be a multiplicative RGBA value (each value from 0 to 1 multiplies the corresponding pixel color) and an additive RGB value (each value is added to the corresponding pixel value, and then the value is “saturated” to the max color value).

Sprites can be scaled; if they’re scaled, the span data includes a fixed-point dS/dx value, which stores the reciprocal of the scale factor. The renderer creates an “S” value which is a pixel index into the sprite. For an unscaled sprite, S is incremented after every pixel. For a scaled sprite, Whiz adds dS/dx to S after every frame. The integer part of S is then used to determine which pixel from the sprite to actually render. If dS/dx is less than 1, some integer S values will be generated more than once, causing some pixels to repeat, so the sprite is scaled larger. If dS/dx is more than 1, some integer S values will be skipped, effectively shrinking the sprite.

Sprites can be reflected horizontally; S is initialized to the last pixel in the sprite scanline, and S is decremented (or dS/dx is subtracted) after each pixel is rendered into the line buffer.

Sprite maps (see below) can be rotated. To do this, another value, T, is introduced. The source data is viewed as a 2-dimensional array of pixels, indexed by S and T. On each span, S is incremented by dS/dx and T is incremented by dT/dx after each pixel. In this case, the span also encodes an explicit starting S value and an explicit starting T value. This causes Whiz to walk “diagonally” through the sprite data. If you imagine a rotated object and a horizontal slice through that object, you’ll see that the slice goes diagonally through the object, so this is exactly what we want. (This actually allows for simultaneous rotation and scaling; rotation without scaling in this scheme requires placing a constraint on the values dS/dx and dT/dx can take on at the same time, but we have no reason to do this constraint ourselves since allowing rotation and scaling is great. The only thing we lose is you can represent rotation alone in less space, but there’s no way to represent it in less space and make Whiz efficient at handling them, so.)

Now, back to the sprite image data. In the simplest version, the span stores a pointer to 32-bit RGBA pixels which are the pixels for this span. This is a “direct image”.

Slightly more complicated case: the span stores a pointer to image data that consists of 8-bit palette entries. Whiz provides for 2048 palette entries; any one sprite uses a contiguous subset of 256 of those entries. (Thus the user can divide it into 8 palettes of 256 values, or 64 palettes of 32 values, or whatever.) To accomplish this, the span also stores a palette “base”. The paletted image pixel value is added to the base, and the entire thing is looked-up in the 2048 entry palette. (The span may not store a full 11-bit palette value; we may require base to be a multiple of 2 or 4 to reduce the number of bits necessary to encode the base.) The palette is stored in block RAM on the FPGA, so it’s cheap to access here.

An entirely different case is the “sprite map” case — this is equivalent to “tile mapping” or “character mapping” on other hardware. A sprite map consists of a 2D array of 1-byte values which are “sprite indices”. These sprite indices then index into a “sprite table” to choose a sprite.

To be more precise, each span stores a “sprite map index”. This is an index into an array of “sprite maps” which are stored in block RAM on the FPGA; each sprite map stores a number of values used for image decoding. (Note that this limits the number of unique sprite maps you can have in a single render, which may be limiting. If this data was stored per span it would be much more flexible, but it would take far too much memory.)

Each sprite map has a “tile size” (e.g. 4×4, 8×8, 16×16, 32×32) and a “map size” (e.g. 64×64, 128×128, 256×256). The effective number of pixels in the map is the product of the two on each axis. (But it wraps around outside of that region.) Given an S,T coordinates for a pixel to display, you split the S and T coordinates into the bottom bits (which index a single tile) and the rest of the bits (which index the map). You concatenate the S & T map bits to create a map index, and concatenate the S & T tile bits to create a tile offset.

There is a “map base” address; you add the map base and the map index and fetch a byte at that address from DRAM. (Actually, you keep the last DRAM fetch for this purpose and check if it’s the same one, so when multiple adjacent pixels come from the same tile, you don’t have to fetch the tile repeatedly.) You take that byte, multiply it by the size of each tile (which is the number of pixels per tile times the size of a single pixel, all of which are powers of two so the multiply here is really a shift), add it to the “tile start” address, add the “tile offset” defined above, and that gives you the address of the pixel. Then you fetch the actual pixel value. (If the pixel value is itself paletted, now you run it through the same palette processing described above.)

Regardless of the above steps, you now have an RGBA pixel. At this point you multiply by the RGBA multiplier for the span, add the RGB addend for the span, and composite the pixel into the appropriate spot in the line buffer, and move on to the next span.

The above flow can get fairly complex, especially in the tilemap case. To make this effecient the pixel processing is pipelined; I’ll talk about what this means at a later date.

Scanline synthesis

In the last few posts I’ve described the mechanism(s) by which a scanline synthesized by Whiz ends up being displayed to the screen.

Today I’ll start talking about how the synthesis works.

Whiz processes a scanline at a time into a “line buffer”. To do this, Whiz uses “painter’s algorithm” — it draws the scene (or rather, the parts of the scene that intersect a given scanline) in back-to-front order. Objects “in front” of other objects are drawn later, which means they overwrite the further-away objects, which means they obscure them.

Because the FPGA has multipliers available, Whiz can perform “alpha blending” (specifically, the “over” operator). The alpha value specifies a pixel that is partially transparent. If the pixel is transparent, some amount of the background will “show through” the pixel. This is useful for things like water and smoke effects, but it actually can offer a huge boost in visual quality by using it to implement “anti-aliasing”. If the edge pixels of a sprite are made transparent in a certain way, so the background shows through, it will appear visually equivalent to how the pixel would appear if the sprite actually had a more-defined edge within that pixel (e.g. the true shape ought to only actually cover about 2/3rds of a given pixel along the edge of the shape, so the pixel has an alpha value of 2/3rds which produces the right output).

Alpha-blending requires source data with alpha, so it may not be all that common. Storing the alpha value also takes up extra space. However, for paletted images it’s not a big deal, and since there’s no support for 24-bit images, only 32-bit images, it’s not a big deal in that case either. The only case where it’s significant is for 16-bit images, which go from being ARGB 1555 or RGB 565 to being ARGB 4444, which may be noticeable.

On the other hand, Joaf supports “dynamic recoloring”, in which each sprite can come with an RGBA 8888 value which multiplies each pixel before it’s displayed. In the case of alpha blending, this means that even if no sprite has alpha data, you can use this to fade the sprite in/out smoothly (as the multiplied alpha value goes from 1 to 0, the object fades out).

Anway, the basic algorithm for Whiz is to traverse an list of “spans” of pixels. Each span is the intersection of a sprite with a scanline. Whiz runs through the list, drawing that span into the line buffer, and then when it’s finished with the list, it’s finished with the scanline, so it can swap/flush the line buffer and start working on the next one.

In online mode, if Whiz doesn’t finish in time, Iris needs the buffer now, so I see two reasonable options. One is that Whiz abandons the buffer and advances to the next one. The problem with this is that since Whiz draws back-to-front, the stuff that wasn’t drawn will be the frontmost sprites, which are probably the most important ones. So having those disappear would really suck.

The second option is that Iris might redisplay the current line buffer (skip advancing it), and then Whiz can finish the one it’s on. This will cause a weird shifting; one scanline will get repeated, and then the next one will be displayed one scanline down. You wouldn’t want to repeat this (there’s no good way to catch up), so Whiz would then skip processing a scanline. If Whiz took too long processing the next scanline too (which seems likely), what you’d end up with is a region where the screen was displayed at half-resolution, and shifted down by an additional pixel.

I’m not sure which of those tradeoffs is more acceptable.

One possible improvement would be to switch from back-to-front rendering to front-to-back rendering. Front to back can be done in existing GPUs with “destination alpha” and produce the same result as back-to-front alpha blending. However, it gives up the possibility of a multiplicative blend mode (though I think you can still do additive). However, if you store a separate per-channel RGB destination “alpha”, I believe you can make back-to-front work and still support a multiply blend mode as well. This would require doubling the size of the rendered line buffer, but that’s not too bad since the other line buffers are unaffected. It also means you have to clear the line buffer every frame. This is a minor drawback; in back-to-front mode you don’t need to clear it, you can just require the user to display something that touches every pixel (since they usually will be), and then it lets demo coders do crazy stuff where they intentionally don’t clear the previous line buffer and thus get weird vertical smearing effects.

Doing the front-to-back thing would be novel and clever but it would still have its own weird artifacts when time ran out — is the background turning black on some scanlines really appreciably better than the foreground dropping out? Obviously games should be tuned so they never have dropouts, so what we do in this case probably doesn’t matter that much. But nobody can test every case, so we’re talking about what happens when something unexpected happens.

In old consoles, when there was “too much stuff” going on, they ran out of sprites, so they just multiplexed the sprites, making each object flicker because they only displayed every other frame. You can obviously do this in software in Joaf as well, but the problem is knowing that it happens, since there aren’t static sprite limits.

Probably I’ll just do whatever’s easiest first — I think that will be making Whiz bail and swap line buffers and start the next scanline — and see what that’s like.

online vs. offline

Last time I described how the graphics synthesizer Whiz sends graphics to the display generator Iris one scanline at a time via three line buffers. But there’s another reason why 3 is the magic number.

That post described “online” mode, in which the scene is synthesized just-in-time by Whiz to feed Iris. (Iris is always “just-in-time”, since Iris is generating the actual display and cannot fall behind without breaking the display.) The awesome thing about classic 2d game consoles that is also true in online mode is that your game just always runs at 60 hertz, because it pretty much has to. (Well, it could take longer than a frame to build the display list that drives Whiz, but it seems unlikely. At a minimum you could always have the player and some objects move at 60 hz; similarly, scrolling is cheap, etc.)

Because (as I’ll discuss in a future post) Whiz supports rotated sprites, it’s actually possible to use Whiz to generate Doom-style graphics (for the technically savant, the key concept is “affine spans”). However, for various DRAM/cacheing reasons, it seems unlikely that Whiz will be able to generate Doom-style scenes at 60fps, even for small screen sizes (e.g. 320×200, or whatever NTSC interlaced shakes out as).

So one of my innovations (I think) is that Whiz supports two modes: “online mode” being the mode described previously, and “offline mode” being one that supports rendering scenes at less than full frame rate.

As mentioned, Iris needs to always display at full frame rate. So the way offline mode works is there is a traditional fullscreen framebuffer. In offline mode, an offline-only part of Whiz is responsible for fetching the framebuffer and feeding it to Iris. To keep complexity down, Iris reads from one scanline buffer while this offline-only Whiz task writes to a second scanline buffer. When Iris finishes a scanline, it switches to the second buffer, and Whiz starts filling the first. (It would be possible to synchronize them more closely instead, but this shares more logic with online mode and allows Whiz to drive the DRAM at max speed to fetch a full scanline.)

Meanwhile, all of the rest of the Whiz circuitry is still present and active, synthesizing a scanline at a time into yet another scanline buffer (a third one). When this scanline is finished, it is flushed out to memory (to a second framebuffer–there need to be two framebuffers just as in normal graphics). Because this third line buffer isn’t ever sent directly to Iris, Whiz can take its time generating it, and end up generating at lower frame rates (30, 20, 15, variable, whatever).

Possibly there should be a fourth line buffer so Whiz can start rendering the next scanline without having to wait for the previous one to be flushed out to memory.

Regardless, one of the nice things about this design is that you can use exactly the same code to build a display list and feed it to Whiz whether you’re running offline or online. For example, you could have the game run at 60fps but in-engine cutscenes run at a lower frame rate, yet both use the same drawing technology (one is just more complex scenes). Or if you detect that you’re failing to hit 60fps you could switch over temporarily to offline mode.

Another consequence of this design is that as long as I specify a separate set of registers to determine the screen size Whiz is synthesizing in offline mode from those that Iris is displaying, Whiz becomes a “general purpose” blitter that can be used arbitrarily by the game to generate its display or even for other tasks.

whiz-iris line buffers

Whiz is the “graphics synthesizer”, although, as we’ll see, if you prefer you can think of it as a blitter on steroids.

Iris generates the video signal in the standard way video signals are generated–left to right, top to bottom. Therefore the “inner loop” of Iris is to generate a single line of the display (a “scanline”) from left to right. The outer loop is to generate each successive line of the display.

Iris displays a single scanline by processing a “line buffer” from left to right and displaying the values in them–the inner loop. When Iris finishes the line buffer, we step through Iris’ outer loop: advancing to the next scanline. To do this, Iris switches to another “line buffer”, which will already have the next scanline’s data — because Whiz will have already filled it out.

So Whiz and Iris are in a producer-consumer relationship. Whiz needs to get done with synthesizing each scanline “just in time” to hand it off to Iris. If Whiz isn’t done, tough luck, it gets handed off to Iris anyway.

We could get away with two line buffers: one that Iris is displaying, and one that Whiz is filling. Or we could have three or more. (When talking about framebuffers, two is called ‘double-buffering’ and three is called ‘triple-buffering’.)

At first, three or more sounds magically superior–if you only have two, and Whiz finishes updating a given scanline, the other one is still being displayed by Iris so Whiz is forced to wait. Having a third one available allows Whiz to keep going.

In practice, this advantage is a lot smaller than it sounds, because you have to look at the steady-state. In the steady state, Whiz finishes with the third scanline and is forced to idle–the third doesn’t help that much.

What more scanline buffers do help for is variability–if some scanlines are more expensive than others, than additional buffers can smooth that out. But, again, there’s a steady state problem. If you have “cheap enough” scanlines, then you hit an expensive scanline that process more pixels and makes you “fall behind” a little, it’s highly likely that the next few scanlines will also need to draw just as many pixels and you’ll keep falling behind until you fall too far behind.

However, there is one exception to this: although number-of-pixels or number-of-active sprites is highly correlated from one scanline to the next and hence suffers from this effect, number-of-sprites-starting-on-a-given-scanline is not. For example, if you have a screen split into two segments, or if you have a bunch of text, you may have a bunch of extra sprites starting on a single scanline. A third scanline buffer may give Whiz the extra buffering it needs to recover from having to process a large number of sprites on a single scanline. (Why is adding a new sprite starting on a given scanline more expensive than processing an existing sprite on a scanline? Existing sprites are already sorted properly, whereas new sprites have to be sorted into the existing sprites. I’ll talk about this more later.)

Therefore we settle on three scanline buffers as the ideal number of buffers for display. More wouldn’t really hurt, but since they’ll be implemented as block RAM we don’t want to get too excessive with them.

joaf components

“Joaf” is a placeholder name for this project.

The old Atari console and 8-bit computers were made of a number of separate chips, each of which had a “cute” name which may or may not have been an acronym (like Antic, Pokey, and CTIA.)

In that spirit, Joaf is divided into a number of logical components (they’ll all be on the FPGA in practice), each of which has a name.

  • Iris is the component which takes a scanline buffer and generates an actual video signal. I’m not sure how this is going to work in the long run, exactly (the prototype board only does something like 2/4 bits per color channel, so it looks pretty lame). Iris doesn’t need to access the external DRAM chip; it just blindly outputs what’s in the line buffer, and it’s up to another component to fill that fast enough.
  • Auda is the component that does audio synthesis and mixing.
  • Poet is the soft-CPU (“soft-CPU’ means the CPU is implemented out of FPGA elements, which is not a very efficient way to make a CPU in hardware).
  • Waldo is the component that interfaces with DRAM. (It might also be in charge of managing a cache for Poet or something.)
  • Whiz is the interesting, magical component: Whiz is the component that synthesizes graphics into Iris’ scanline buffers. Whiz is going to be the primary user of the DRAM (it’ll have priority over Poet) and also the block RAM. (Block RAM is RAM built into the FPGA itself — it’s not constructed out of regular FPGA components, rather it’s built directly out of the silicon so it’s much more efficient use of chip real estate–unless you don’t need the RAM.)
  • MC is an extra little component that is an optional mediator between Poet and Whiz: Whiz wants data packed up tightly in little bundles. It will be expensive for Poet to do this directly, since it will require multiple shifts and masks and ors and memory writes and such. Because packing bits together is trivial in hardware, MC is an extra hardware component which only exists to pack the bits together in exactly the way Whiz needs; MC is directly wired to Poet and Waldo — Poet requests a bitpacked bundle, and MC packs it and writes it to memory.

Whiz will be the primary subject of Joaf postings for the indefinite future. I have a bunch of ideas for how to make a more game/graphics-friendly CPU by simply including a bunch of instructions that are obvious and simple but that most CPUs don’t provide. But it turns out the Larrabee instruction set includes most of them as well, so it’s not that interesting. (This is in one sense unsurprising: the additions to the Larrabbee instruction set were driven by game/graphics technology programmers.)

Also I’ll talk about MC some more.

joaf

One of my lingering projects is a videogame console on an FPGA. I was inspired in part by this project.

Mainly, I love the idea of a piece of hardware generating the video display “just in time” (as in the Atari 2600, for which I also have a lingering project). I’m not very interested in a console-on-a-chip that’s just a fast CPU. Doing some kind of crazy graphics-generator is where all the fun is.

I could build something cool and fun with money as no object, but I don’t think that puts interesting constraints on the project. Instead, my goal is to produce something that other people could build cheaply for themselves if they wanted one, and that it be interesting enough that they might want one. I don’t actually think they will, but I want it to be an option, if only to constrain the design. (I like to picture demoscene coders getting bored with ever-increasing performance of modern 3d hardware and turning to interesting retro platforms in the future. Again, I don’t think they will, but I want something that would interest them if they did.)

So, the goal is to produce a videogame console using a small number of cheap parts. E.g. maybe a $40-50 part count. That means a reasonably low-end FPGA (no PowerPC-on-the-chip FPGAs), a little RAM, various overheads.

I’ll develop the project entirely on an FPGA prototyping board (although I haven’t found one with 24-bit color, so that might be a deal-breaker for this path), and other people who want to play with it without assembling from parts could also buy the same FPGA prototyping board (which is much more expensive, though).

I bought a Cyclone II prototyping board last year, although I haven’t done anything with it except run the test projects (I think I compiled it from scratch, too). That’s because it doesn’t come onto the actual development path for a while:

  1. Complete a paper design of the basic graphics engine
  2. Build a simulator for the paper design
  3. Verify the paper design works and is usable and interesting
  4. Implement the design on the FPGA
  5. Integrate an existing C-compilable soft-CPU onto the FPGA
  6. Make some tests using the combined system
  7. Find a good way to accept user input on the proto board
  8. Make a “game”.
  9. Implement the audio playback engine on the FPGA
  10. Make some tests using the combined system
  11. Make a game.
  12. Finish the design of a RISC soft-CPU
  13. Implement the CPU on the FPGA
  14. See if the combined system is small enough to fit on a reasonable FPGA.
  15. If not, go back to the pre-existing small soft-CPU and save the RISC CPU for another project.

I’m mostly done with step 1, the paper design of the graphics engine, and I’ve started step 2, the simulator.

The basic design for the graphics engine is for a primarily-2D engine inspired by the Atari Jaguar; the hardware generates the graphics “just in time”, with a one-scanline buffer.

Having a one-scanline buffer for the graphics chip to render into simplifies the complexity of having a very large number of sprites on a single scanline; implementing this the way the old low-sprite-count machines did (by computing every output pixel left-to-right) would be prohibitively expensive. Instead, you can have an “arbitrary” number of sprites on each scanline, as long as you have enough time to draw them all.

In a different design, using a one-scanline buffer could allow you to make a console that, say, generated HD output without having any external RAM at all (specifically, no framebuffer)–using only the small amounts of RAM on the FPGA as your main memory. But you’d be fairly limited in the sorts of things you could draw… gouraud-shaded polygons, maybe? Perhaps you could load sprite data from the game ROM, still? (What’s the latency to an external ROM like?)

Instead, I’m assuming you’re drawing all sprites, and they’re coming from RAM. This means DRAM latency may be the limiting factor for how fast some complex effects can be done. (The system supports scaled and rotated sprites, and, as a side-effect, supports limited Doom-style 3D graphics as well, although those have even worse DRAM latency issues.)

Actually, now that I’ve written all that, I’m thinking maybe it would be worth making a simpler graphics engine first, one that does the no-external-RAM thing. This might be good “practice” towards doing the full project, and might be an interesting gimmick in its own right. (HD would still be hard to achieve, though; 1280×720 at 60fps is 50M pixels, so if the processor is clocked at 50Mhz there’s only one cycle per output pixel.)

project blog

I’ll be using this blog to post notes about in-progress personal projects for projects that are long-enough term that I want to talk about them before they’re done.