sAGAs buffers and DMA

1. sAGAs – Milkymist
2. sAGAs – FPGA synthesizer
3. sAGAs – Synthesis
4. sAGAs – Oscillator design
5. sAGAs – Mixer
6. sAGAs – buffers and DMA

To keep the data flowing in the oscillator units, sAGAs deploys a large number of on-chip RAM buffers, each dedicated to a specific purpose, and a DMA system for reading and writing them to memory. All memory traffic goes through the DMA unit.

Given the goal of keeping the oscillator producing useful data every cycle, DMA makes more sense than a cache; a cache requires systems to stop on a cache miss. Additionally, we’re always processing frames of samples (like 64-256 samples), so there’s enough coherence there to DMA in/out a big chunk of data.

The DMA system reads from main memory into a temporary buffer, and then dispatches the temporary buffer to the on-chip RAM. (This temporary staging is done for bandwidth reasons, discussed later.) When writing to main memory, the data is gathered into a temporary buffer, and then written out.

It’s also possible to transfer data between certain internal RAM buffers without going to main memory, by staging them through the DMA temporary buffers. These will still have significant latency, but they won’t use main memory bandwidth. This isn’t necessary for the core synthesis algorithms, but it allows you to push synthesized data back to earlier stages for further processing, which may be useful in various ways.

To make everything work efficiently with these DMA’d on-chip RAMs, nearly everything must be double-buffered. While the oscillator is waiting for new PCM data to be DMA’d in, it needs to be able to do something else; so there’s more than one sample-memory buffer per oscillator. When the oscillator finishes with a voice and triggers the mixer to mix it, the oscillator has a second voice buffer that it can start mixing the next voice into.

These buffers are stored using FPGA block RAMs. The design is optimized right up against the capabilities of the Spartan-6; if future FPGAs were to reduce any of the data-widths or address-widths available it would cause problems, but asssuming that doesn’t happen, there should be no problem going forward (although the design may waste wider/longer memory).

For example, the block RAMs are nominally 16/32-bit, but are actually 18/36-bit. Although we’re used to audio sample data being 16-bit, I was considering having at least some of the sample buffers be 32-bit for extra precision (for example sometimes you want to move an oscillator buffer into sample memory so you can do interpolated sampling from it). However, since the FPGA multipliers are limited to 18-bit wide, I decided to just settle on having the sample buffers be 18-wide as well, since any more is wasted.

The Spartan-6 block RAM is 18Kb (kilobits, not kilobytes), but can be split into two independent 9Kb units. Each unit has two read/write ports. Each 9Kb unit can operate as 256×36, 512×18, 1024×9, etc. However, in 256×36 mode, there is one read port and one write port, whereas in other bit-widths each port is individually configurable to read or write. (Presumably the block RAM circuit has 36 input data lines and 36 output data lines, so in 256×36 mode there’s only enough lines for one read and one write simultaneously, whereas in 512×18 and narrow, there’s enough lines to issue two reads or two writes siultaneously.)

What block RAM buffers do we need? How many of each do we require? Do they need to be double-buffered?

Here’s the list:

  • sample buffers: We need to DMA new sample PCM data into sample memory, so double-buffering these makes sense. We would also like to store some sample data (like sine wave tables) permanently. Since each oscillator can load from two sample buffers per cycle, this suggests having four sample buffers per oscillator.
  • mixer mix buffers: in normal case, DMA from mixer is rare, there’s not any particular need to double-buffer the mixer buffers. However, we want multiple mixer buffers for other reasons (to allow each voice to be output on multiple busses, so that you can e.g. have separate reverb-levels per voice without having to mix the voice multiple times), but you could use the extra for double-buffering instead if you’re doing more complicated things with the mixer

  • oscillator voice buffers: double-buffered so oscillator can run while mixer is mixing or DMA is loading/saving voice data. Most of the time this doesn’t really need to be double-buffered, because the fraction of time the data is being read/written by DMA or mixer is low.

    However, if you’re doing PCM mixing or somethign where the oscillator only makes a single pass over the voice buffer, then you’d spend 50% of your time waiting.

    Since this is mix-buffer bandwidth bound, you could do PCM sampling from both oscillators, both wait 50% of the time, and saturate the mixer, so even then double-buffering isn’t necessary. However, if you’re doing some PCM voices and some (say) additive voices, the optimal path is to PCM mix in one oscillator and additive in the other. With double-buffered voice buffers, the PCM oscillator can run 100% of the time and use 100% of the mixer buffer, and then it will just stall on the rarer times when the other oscillator needs to use the mixer.

  • DMA burst buffers: The DMA reads/writes memory 64-bits per cycle. Many of the units the DMA is reading/writing to will be in 256/36 mode, meaning the DMA can only read or write a single 32-bit value per cycle to those buffers. So the DMA transfer actually happens to/from 64-bit-wide burst buffers, and then the transfer to the actual memory happens at a lower rate. There are two burst buffers to allow doubel-buffering.

    Note that this means the effective DMA bandwidth is cut in half for those units, because even with the double-buffering, you’re limited by the rate you can actually deliver the data to the final memory (the DMA can’t write to independent two sample banks simultaneously). But note that we only expect to achieve 50% of the theoretical bandwidth anyway. Also note that with two sAGAs cores, each could only use at most half of the available bandwidth anyway. (With more cores, we assume the extra cores will try to minimize transfers–i.e. be performing additive or MinBLEP synthesis, where all the tables can be preloaded.)

  • Command queues: The command sequence for the oscillators is DMA’d into memory. Each oscillator gets a separate block RAM to store the pending command queue for that oscillator. (The mixer will have a short queue of pending commands, and doesn’t need block RAM for it.) Commands are 128- or 256- bits. If a single oscillator operation required both a 128-bit and a 256-bit command, and the command queue was 32-bits wide, it would take 12 cycles just to parse out the commands needed for the operation out of the queue. This might cause a stall in the oscillator pipeline, so we require the command queue to be 64-bits wide.

How do those actually break down to physical resources? Here’s the 9Kb buffers used by a single sAGAs core (two oscillators, one mixer):

  • 4 – oscillator voice buffers: two 256×32 buffers per oscillator
  • 4 – mixer mix buffers: two sets of stereo pairs of 256×32 buffers
  • 32 – sample buffers: four sample buffers per oscillator, each seemingly 2048×18 but implemented as 2 1024×18 buffers; thus 8 18Kb buffers per oscillator, or 16 18Kb total, or 32 9kb buffers
  • 4 – DMA burst buffers: two sets of two 256×32 block RAMs, each pair of buffers allows the DMA to burst at 64-bit/cycle
  • 4 – command queue buffers: two 256×32 block RAM per oscillator, to give effective 256×64

Note that the sample buffers totally dominate. I had originally made them smaller, but when it became clear sAGAs was going to be multiplier-limited on Spartan-6 and not block RAM limited, I increased the size of the sample buffers.

The above comes to 48 9Kb buffers, or 24 block RAMs. The Spartan-6 in the Milkymist has 116 block rams, so each sAGAs core will use about 20% of the block RAMs.

That’s a lot, but as stated before, each core is using 12-14 “DSP” multipliers, which is 20-25% of those available. With four sAGAs cores (which is pushing it anyway), we’d use 96 block RAMS, leaving only 20 for the rest of the system. This is as it should be; I’m trying to design a system where sAGAs uses up most of the resources.

But is there enough left for the other components we need? Milkymist uses a 32KB cache for the soft CPU; that requires 16 block RAMs. And we still need display generation (so you can hook up a monitor directly) and anything else I’ve forgotten. However, I don’t think we’d ever actually have four sAGAs cores. If we did, we could have half of the cores only contain half as many sample memories; those cores would never do PCM synthesis, only minBLEP or additive, so they wouldn’t need to double-buffer their sample memory. (We could also shrink the sample memory down to 1024×18 and have just as many sample memories, but the larger size allows better higher-quality oversampled sine-wave/MinBLEP tables, so I think it makes sense to push the size and keep the quality.)

I should back up and explain one oddity. For interpolated sampling, the oscillator needs to read two values from the same sample buffer at a time. You might think we do this with the two read ports on each block RAM, but instead since we need two block RAMs anyway we read the even-indexed sample from one and the odd-indexed sample from the other. This leaves the other port free for use by the DMA.

We do this because it allows us to avoid putting a mulitplexer (MUX) on the port’s address lines (if the port can be accessed by two different units, then we must MUX the inputs). MUX operations can be expensive on FPGAs, although the current generation is better than previous generations, but wide MUXs are still bad. We try to have no MUX that’s wider than 4 inputs, and keep most at 2.

This is also one of the reasons we keep separate sample buffers for each of the two sample-buffer lookups in the oscillator, and the reason the oscillators don’t share sample buffers. I actually worked through the design for each oscillator being able to choose any 2 of its 4 sample buffers, rather than picking one from each pair; my estimate is that doing this only to the sample buffers still doubled the total amount of MUXing needed for hooking the block RAMs up to the functional units and DMA across the entire sAGAs core–from about 250 6-LUTs to about 500 6-LUTs. (A 6-LUT is the core logic element of current FPGAs; two 5-bit lookuptables or one 6-bit one.) The Milkymist Spartan-6 has 24000 6-LUTs, so an extra 250 is only 1% of the total, so once I have the paired sample buffer design working it’s something I may try revising into the more flexible mode. (I don’t think the paired design is overconstraining in practice, but it’s the sort of constraint that is often annoying from software, so if there’s resources available, it’s probably worth changing.) And that means that the control register syntax for this will be designed from day 1 to allow you to specify any of the four buffers for each lookup; the initial implementation will ignore one of the bits you specify, but this way I won’t have to change the rest of the pipeline if I add this.

sAGAs Mixer

1. sAGAs – Milkymist
2. sAGAs – FPGA synthesizer
3. sAGAs – Synthesis
4. sAGAs – Oscillator design
5. sAGAs – Mixer

The sAGAs Oscillator component described previous is a mono component. We want to produce stereo output, so we introduce another component, which we call the Mixer component.

The mixer reads from a finished oscillator buffer and sums it into a “final mixer buffer”. It does this in stereo, so it needs two multiplies, one for the left channel and one for the right. In fact, to make things more flexible, I’m considering having sAGAs burn four multipliers on this so it can implement a 32×18 multiplication, instead of only 18×18. This way the oscillator buffer, which is fed by an 18×18=>36-bit multiply, can contain 32 bit-data instead of only 18, which gives more flexibility in writing the oscillator and maintaining precision there. (An alternative approach to avoid using four mulitpliers might be to offer two optional shifts; the mixer could load the oscillator buffer then use the bottom 18-bits of either X, X>>4, or X>>8. Another possibility would be to have the mixer operate at one sample per two clocks, and run the data through the multiplier twice.)

Envelope ramping can also happen in the mixer as well. (The original design didn’t allow every rescale factor to be ramped like the current design does, so this might seem redundant, but it will also simplify the CPUs job if it can separate additive-synthesis partial amplitudes from envelope amplitudes.) Originally, my design toyed with using a lookup table you’d fill with smoothstep() or a more DSP-ish window to make the envelope ramping be C1 continuous, but I suspect that’s overkill — and it burns two more DSP multipliers doing the interpolation. When I do the initial software prototype (soon!), I’ll try and evaluate this and see what I think.

Now, the Mixer can only run once the Oscillator is finished computing an oscillator buffer. If the Oscillator is doing additive synthesis, summing many sine waves, or doing band-limited synthesis summing many MinBLEPs, it will be a while before the Oscillator finishes a buffer. For this reason, a single sAGAs core has one Mixer unit, but two Oscillator units. That way the Mixer unit will be idle less often, and it makes the cost of having four multipliers in the Mixer less expensive per-Oscillator.

However, it means that if we can compute an Oscillator voice in a single pass–say we’re doing PCM sampling synthesis, so we just sample the waveform and we’re done–then the Mixer can only keep up with one Oscillator, and not two, and half of our Oscillator cycles are wasted. This is the cost of doing business; we’re can’t be optimal for every synthesis method.

Anyway, the upshot of this is that each Oscillator uses 5 FPGA multipliers, and the Mixer uses 4, so we need 14 multipliers total. The Spartan-6 in the Milkymist has 58 multipliers; the CPU maybe needs 4 to compute a 32×32=>64-bit multiplier, maybe another 1 or 2 (to perform fast barrel shifting). The screen display unit will need 3 to compute alpha-blended displays. That means we have about 50 left for sAGAs, so with 14 per sAGAs core, we can fit three sAGAs cores into the FPGA. At 95% utilization, that gives us 25K sine waves/second at 44.1Khz, so 100 voices with 256 partials, or 200 voices with 128 partials.

Note that if the Mixer only uses 2 multipliers — either we give up on high precision, or we use FPGA logic for the extra bits, or we make the mixer mix a sample every two cycles — then we could reduce the total to 12 multipliers, at which point 4 sAGAs cores becomes feasible — 48 multipliers total.

(An FPGA designed specifically for additive synthesis, but using this same basic design, could basically deliver one sine wave for every two multipliers, so six sine waves for every 12 multipliers, where we deliver four sine waves for every 12 multipliers. So we’re both pretty far off optimal, and really not all that far off optimal.)

sAGAs oscillator design

Since this is a reverse-chronological blog, here’s a list of the sAGAs posts so-far, in correct order. I highly recommend reading these theory posts in order, not reverse order.

1. sAGAs – Milkymist
2. sAGAs – FPGA synthesizer
3. sAGAs – Synthesis
4. sAGAs – Oscillator design

The design for sAGAs is primarily limited by the number of “DSP” multipliers available on the FPGA. The design proceeds by trying to keep those multipliers in use every cycle, and adding whatever logic is needed to try to achieve that.

The core sAGAs unit is called the “oscillator”. The oscillator is capable of computing two interpolated-table-lookups each cycle and computing two more multiplies (in various configurations), and summing the results with a buffer stored in on-chip RAM. (It can also compute an IIR biquad filter, which requires five multiplies when using low-precision fixed-point multiplies, so we actually use five multiply units and waste one when we’re not doing biquads.)

Now, the system was originally designed primarily around the faux-analog band-limited synthesis problem, and it turns out to be pretty wasteful of multipliers for that. So in practice what we end up with is the oscillator tries to leverage multipliers as best it can (even if some configurations will use zero of them), but then the rest of the system is designed to keep the core oscillator pipeline itself from ever stalling, if not the multipliers.

So, let’s look at how the oscillator design evolved as I added synthesis methods.

  • MinBLEP – To do our “analog band-limited synthesis”, we need to be able to “draw straight lines” and accumulate them in a buffer, and then add “MinBLEP” and “MinBLAMP” fixups into the buffer as well. The fixups are implemented as interpolated table lookups times a constant. The interpolated table lookups require a DDA-style iterated fixed-point variable used as the address. For “draw straight lines”, we just need to use the iterated fixed-point variable itself as the value to put in the audio buffer, instead of the lookup.
  • Granular – Granular synthesis requires grabbing sections of PCM data, applying a window to it, and adding it to a buffer. We can use the interpolated-table-lookup from above for the PCM data, and then we need another table-lookup to access the window. For flexibility, we should probably make that one an interpolated lookup as well, so a single window table can be resized. So now we require two interpolated lookups multiplied by each other–so one multiply per lookup, plus one to multiply them together. (The same hardware can also be used for ring modulation of PCM waveforms.)
  • What about FM synthesis? To do FM synthesis, we need to feed the output of one interpolated table lookup into the input of the second one. To do this, it turns out we need a fourth multiplier (we need to scale the first table lookup before using it for addressing, and we need to scale the second table lookup for output to the buffer), but otherwise it doesn’t take much hardware. Unlike the other synthesis methods, I don’t know how to avoid aliasing here, so this one is speculative.
  • For additive synthesis, we can use the table lookup for sine waves. Can we use both table lookups? We don’t want to multiply them or feed one into the other, we just want to add them. So each needs its own scalar–so we need four multiplies total, same as FM.

    Also, we need to be able to control the amplitude of each sine wave independently, and they need to be continuous across audio frames, so we need to be able to ramp the amplitudes. That means the scalars we multiply by need to themselves be DDA walks.

    (My original thought was that avoid discontinuities as the amplitude of partials changed could be done by mixing, say, 16 samples at the old amplitude, running that through the mono-to-stereo mixer with a 1-to-0 ramp, then mixing the full frame of samples with the new amplitudes, then using the mono-to-stereo mixer with a 0-to-1 ramp for the first 16 samples, then just a multiply-by-1 for the rest of the samples. But I decided that seemed more clumsy and wasteful compared to just making everything DDA stepped. Note that also I was originally thinking that since the on-chip buffers will have 256 samples, you would want to not waste them, and use something like 200 samples per frame, which meant that computing the 200-samples DDA step size would be expensive (a divide). But for latency you want shorter frames anyway, so now I’m assuming you’ll use 64-sample frames (or 128-sample frames if you’re oversampling), so now the DDA step size is just a shift, or possibly not even a shift (I may make the step size input be shifted by 6 bits internally, so the CPU feeding it can just subtract end-start).

  • Waveshaping can be done with the FM path (compute a waveform, then run that waveform through a table lookup).
  • Because we can turn on and off whether each of the two table lookups happens, we can actually draw two lines simultaneously, one with each lookup. Adding these together is pointless (the result is still a line), but multiplying them produces a quadratic–but in an annoying form for specifying (you have to factor the quadratic). However, since we forced the other multiplication factors to ALSO be DDAs, we can actually express a more convenient quadratic by computing two lines multiplied together to only be the ax^2 term, and then another line to be the bx+c term. This isn’t necessarily useful at all, but it’s basically free at this point (we just need a few extra control signals in the hardware, no extra multipliers or adders)

To achieve all of this, the core oscillator unit is pipelined (around 10 stages deep), and it uses many separate RAM blocks to make sure it has all the memory bandwidth it needs to keep the spice data flowing.

This pipeline is what that circuit diagram I posted before is showing (and it includes the biquad pipeline, which shares the multipliers as well):

Control signals haven’t been pipelined yet.

If you look at it full-size, here’s the deal: the bottom-most register (on the left) controls whether the biquad pipe or the synthesis pipe is active. If the biquad pipe is active, all the multipliers compute A*B+C, and C is always fed in as the sum from the previous multiplier. This computes a Direct Form I biquad, which is best for low-precision multipliers (especially with higher-precision adds), as here where Spartan-6 gives 18×18 multiplies and 48-bit adds. (It’s best because in direct form, you only ever form products of your filter coefficients with input values and output values, never intermediate temp values. If the output values don’t clip (you normally don’t want them to clip!), this means you can use all the available precision for the multiplies.

The multiplier units have optional internal registers, and for constistency I’ve enabled them for the B and D inputs, which otherwise have the highest-latency from input to output of the multipliers. Most people don’t try to pump FPGA biquads at one output per cycle, because you have to feedback the output result back into the last multiplier (last two, actually), and that can become a critical path. So the recommendation is to emit them every other cycle (or such), and crank the clock rate up as high as possible. Since Milkymist is only clocked at 100Mhz, however, I don’t have the option to crank the rate. The latency of the A input to the multiplier output (register) is 6.40ns, so I’m assuming that the FPGA can manage to route that output back to the input in 3ns and meet the timing requirements.

I may expand the above discussion and move it to a separate post, so don’t be surprised if it shows up again.

As I mentioned previously, one flaw of this design is that the MinBLEP path, which was the original motivation for this whole system, actually only uses two multipliers — one for the interpolated table lookup and one to rescale it to the right magnitude, exactly the same as is used for, say, additive sine-wave synthesis. But unlike additive, I can’t compute two of these at the same time and add them, because they’re offset in time. (It could be done if there was padding at each end of the table, and I use the extents of both, but that’s inefficient and kind of gross. I have some other ideas for things to try, but they’re probably too expensive for the CPU.) An alternative strategy would be to add in even more hardware so that the two scaled interpolated lookups can be added into two different positions in the buffer, or, since this might be more likely efficiently implementable, two different buffers. Unfortunately, neither of these seems very feasible; the former requires RAM with two write ports and two read ports, which can’t be efficiently implemented with FPGA block RAM; the latter, by doubling the number of oscillator “voice buffers”, requires a massive increase in muxes for the other components that read/write those buffers, the details of which I’ll discuss more in the next installment.

sAGAs synthesis

1. sAGAs – Milkymist
2. sAGAs – FPGA synthesizer
3. sAGAs – Synthesis

The magic of audio is that it’s essentially linear in the mathematical jargon sense–almost everything is sums of simple products. A waveform is a stream of numbers. Multiplying a waveform by a constant makes it louder or software. Adding two waveforms mixes them together like an audio mixer.

Sample-Based Synthesis

Sample-based synthesis, also called PCM sample synthesis, works by recording a waveform of a sound, and then simply replaying that waveform to recreate the sound. We can make it softer or louder by multiplying it. To make it higher- or lower- pitched, we need to resample it; we can (poorly) resample it by computing linearly interpolated samples from the waveform.

Resampling by linear interpolation introduces aliasing, and the aliasing gets worse the more significant the resampling is, but correctly-anti-aliased resampling by large amounts still doesn’t tend to sound very good since natural sound sources don’t tend to speed-up or slow-down proportionately the way resampling causes. As a result, simple linear interpolation should be good enough; quality can be improved by prefiltering, oversampling, and using a final high-quality downsampling filter (which will be applied to a final mixed signal, not independently to every sound).

Now, artifical sounds like a square wave or sawtooth wave do keep their spectrum as they shift in pitch, and the core waveform doesn’t inherently evolve over time, so storing those waveforms to wave memory and using a high-quality resampling might be an appropriate mechanism for them. Fortunately, for these we only need to store one cycle of the wave (as opposed to a traditional sample, which is constantly evolving), so we can probably store prefiltered “mipmaps” and other techniques to get a high-quality result. This will be the mechanism used by sAGAs for “arbitrary waveforms” that are drawn by the user.

Band-limited Synthesis

However, we have a different technique for square/saw/triangle waves, or, in general, for any waveform defined (before band-limiting) entirely as straight lines. This synthesis technique involves some things called MinBLIT, MinBLEP, and MinBLAM, which I will briefly summarize, because explanation on the internet at large is scattered.

An FIR filter is a filter that is a simple convolution (there is no feedback or extra state). Imagine taking a “digital” square wave (one that simply jumps between 1 and -1) and using a high-quality FIR filter to downsample an oversampled digital square wave. Make the square wave really low frequency, like 1 hz. What you’ll end up with is something that wobbles around each transition from low-to-high or high-to-low, but is absolutely flat at -1 and 1 over the long times between transitions (because the limited window of the FIR filter can see nothing but all-1s). Now, increase the frequency of the square wave. The flat regions get narrower, but the wobbling transition from low-to-high is exactly the same regardless of the frequency of the square wave, because the FIR filter sees the exact same data within the transition region. Not until the frequency gets high enough that the period reaches twice the FIR-filter width do the shapes change.

(Ok, it’s slightly more complicated than that: the square wave actually goes from low-to-high in between samples, and the location between samples where it happens changes what the wobbly transition looks like. However, if the transition always falls at the same spot in between samples, the wobbles will be identical, even as the frequency increases.)

We can capture that transition wobble and store it in a buffer. (To deal with the sub-sample problem, we actually oversample the transition wobble, capturing it for, say, 32 unique subsample positions, and interpolate them.) Now, to synthesize a band-limited square wave, we just draw the normal square wave, and we overlay that transition wobble at the point where the square wave goes from low-to-high or high-to-low.

What about when the frequency gets higher and they would overlap? We add them. Let me explain that sensibly. Let’s step back from a square wave to a simple case; consider just a step function: a function that is everywhere to the left of the origin 0, and everywhere to the right of the origin is 1. We can band-limit this using the same wobble function described above. Now, because audio is linear, we can consider a square wave to be a sum of these step functions shifted in time. Each transition in the square wave requires a single instance of the step function (multiplied by 2 or -2 depending on the step direction); if we sum them all up we get the square wave. We can create the step function by being either 0 or 1 or, in the mid region, using the precomputed step wobble. Then we just need to sum all of those waves. Summing the infinite 0s and 1s is easy; that just produces the naive digital square wave. And if it turns out the wobble regions overlap, that’s fine, the above logic still applies; we just need to sum the wobble regions where they overlap, since we’re just summing the band-limited step functions.

How does this shake out as an implementation? Let’s say you’re mixing 100 samples at a time, and your wobble fixup is 32 samples long. Then you need a 131-long buffer. You compute your digital waveform (1 or -1) for 100 samples, and add that into those 100 samples. Then you go back through, and at each transition point you add in the appropriate transition wobble over the next 32 samples. (To do this, I store the transition wobble so it goes from -1 to 0. That means at the exact point where I start adding the transition wobble, I change the discrete digital waveform from 0 to 1. Then the added waveform (going from -1 to 0) restores the start value back to approximately 0, but then the added amount gradually approaches 0, exposing the final sum of 1.) If successive transitions overlap, no problem, they’re just added. The last 31 samples in the buffer store any “pending transition wobbles”, and are used to initialize the first 31 samples of the next frame of audio data.

How do we get a practical transition wobble? The general technique arose from a paper which introduced BLITs, band-limited impulse trains. They computed band-limited impulses in sequence, then used integrating filters to integrate the impulses into steps. But rather than computing impulses independently, they pre-linked them into trains, which required knowing the distance between successive impulses (which means the frequency). BLEP throws this out and pre-integrates the BLI; an integrated impulse is a step, so an integrated band-limited impulse is a band-limited step. Now, one drawback to the approach described above is that the natural FIR filter has an inherent delay, since it’s symmetric around the transition. We’d like to minimize that latency. We really want more of an IIR filter response, windowed. The windowing means it’s really just an FIR in the end, we just want a different kind of FIR than normal (we want one that’s not symmetric around the midpoint).

Fortunately, there’s a well-known technique for converting any FIR filter to a minimum-latency (“minimum phase”) FIR filter; it involves computing the “cepstrum” of the filter. So in the end what we do is take a band-limited pulse (a windowed sinc), compute the corresponding minimum-latency FIR filter, integrate that, and out pops a minimum-latency band-limited step, aka MinBLEP. Here’s code. (public domain yay!)

Not only does this work for square waves, but it also works for sawtooth waves; sawtooth waves just have a single jump discontinuity once per period, and are otherwise straight lines. (This also lets us create band-limited “hard sync”, in which one oscillator resets to 0-time every time another oscillator finishes a period. Both the sawtooth and square wave can be reset with a single step discontinuity. Hard sync was the focus of the paper that introduced MinBLEPs.)

It doesn’t work for triangle waves, which are C0 continuous but have a C1 discontinuity at each peak. To deal with triangle waves, what we need is a band-limited C1-discontunity; imagine the wave y(t) = |t|. If we precompute the wobble around a band-limited version of that function, we can substitute it at any other C1 discontinuities. In practice, we use the slightly simpler function which is 0 when t<0, t when t>=0, [ aka y(t) = (|t|+t)/2 ]. (Again we can understand a triangle wave as a sum of these steps, one at every reversal of the triangle-wave slope.)

Now, we can compute this wobbly fixup by just integrating the MinBLEP. The problem we’ll discover is that this fixup doesn’t converge to the original function. Remember, the MinBLEP from 0 to 1 absolutely converges on 1 at the end of the waveform. The MinBLAM (as it is called in one or two places; this is seemingly cutting-edge stuff) has a slope that absolutely converges on 1 at the end, but its position is not correct; it has been time delayed by a significant number of samples. To deal with this, we either need to actually add the MinBLAM time-shifted relative to where the digital line discontinuity occurs, or we have to add in an extra constant fixup each time a MinBLAM is added.

With MinBLEP and MinBLAM, we can now synthesize any arbitrary waveform made up of line segments, and properly band-limit it. (In practice, we may still want to oversample somewhat, since the MinBLEP and MinBLAM aren’t perfect, since they’ve still been windowed and have limited oversampling.)

Another problem we have is pitch changes. When the pitch of a square wave changes, the discontinuous steps just get spaced further or closer together. But when the pitch of a sawtooth changes, the slope of the wave actually changes. This may introduce audible aliasing, but we can fix it up by adding a MinBLAM at that point.

Of course, changing the amplitude of the sawtooth wave also effectively changes the slope of the sawtooth, but we ramp the amplitude smoothly; the pitch change simply suddenly changes the step-size at a frame boundary. But if you wanted to change the volume discontinuously, you could certainly pursue actually treating the amplitude-and-pitch-changing sawtooth waveform by combining all those effects explicitly into the digital line sequence, and generating the appropriate fixups for all of them.

One problem with MinBLEP and MinBLAM is how many fixups are needed. As the frequency of the waveform gets higher, more fixups are needed. If the fixup length is 32, and the period is longer than 64 samples, we perform fewer fixup operations than primary wave operations. But at, say, 1.1Khz for 44.1Khz output, the period is 44 samples, so a square wave will compute 64 samples of fixups for every 44 main samples. At 11Khz, it will need 64 samples of fixups for every 4 samples. This is significantly inefficient. Thus if you tried to “supersaw” at 11Khz using MinBLEPs you’d be using a lot of computation.

(A supersaw is just a bunch of sawtooth waves slightly detuned from each other. In implementation, that means it’s just a single wave climbing much, much steeper, and doing MinBLEP’d discontinuity more often, one each time any wave wraps.)

Additive Synthesis

An alternative synthesis method for analog waveforms is additive synthesis, where you just sum up a bunch of sine waves representing the overtones of the desired wave. (For a stable repeating waveform, it must be representable as the sum of sine waves with frequencies as integer multiples (this is both exactly what the Fourier Transform computes)). This is viable because sAGAs can do so many table lookups. Additive has the opposite problem of MinBLEP; 11 Khz, only the first and second harmonic fall within the nyquist limit and need be synthesized; at 1.1Khz, only the first 20 harmonics do. So it’s possible to select between the synthesis methods depending on the pitch of the waveform.

However, if the pitch changes over time, it’s not really viable to switch from one method to the other, as the MinBLEP approach introduces phase changes that aren’t sensible to try to reconstruct with additive synthesis, which means that cross-fading between the two would cause audible artifacts. So most likely you would pick a particular synthesis method based on the starting pitch of the waveform, and then live with the performance consequences if that pitch changes over time.

Granular Synthesis

Granular synthesis breaks sound up into “granules” and strings them back together. Basically, you take little snippets of sound and multiply them by windows, and then resum them. I don’t really know much about it, and I’ve never tried it.

Closely related, however, is pitch-shifting. Take a waveform at a known frequency, and then break it into frames each one period long. Slightly expand the frames and then window them, so that if you sum all the frames you get back the original signal. Now, first simple case: move the frames all a little closer–if they’re 200 samples long, start each one 180 samples after the previous one. What happens? The sample plays back faster, and the pitch shifts proportionately, but weird things happen to the overtones.

More usefully, you can simply omit some frames, or repeat some frames, but always have every frame 200 samples after the previous. If you omit some, the sound will speed up. If you repeat some, the sound will slow down. But either way, the pitch will not change. Now, resample the new sound to slow it down or speed it up by the opposite amount as the first operation; now you’ve pitch-shifted the sound without changing its speed. How good is the quality of this pitch-shift? I have no idea.

FM Synthesis

FM synthesis works by “modulating” the “frequency”. Picture a sine wave. Now, imagine we sped-up and slowed-down the x axis synchronously with the waveform repeats, so the waveform stretched and squished. We could do this by, instead of computing sin(x), computing some more complicated function of sin(), specifically one that would repeat every 2pi as well. For example, sin(x + k*sin(x)). Or sin(x + k*sin(3*x)). And they don’t have to be sine waves; either the inner or the outer function could be square waves, sawtooth waves, triangle waves, or even an arbitrary PCM sample (but the intersting FM sounds arise from using simple tones in simple ratios to each other).

I do not know how you can cleanly antialias FM sounds. (Most likely you’d just oversample them by 16x or 64x and put a lot of work into decently downsampling them.)

Subtractive synthesis

Subtractive synthesis starts with simple waveforms (like sawtooth waves) and then applies filters to them to remove frequencies. (And sometimes boost them, but they can’t add in any overtones that are literally zero.)

To do this, we need filters, so let’s look at the filters. (Actually, if you use additive synthesis to create the basic waveforms, then you can just explicitly apply the filter in frequency space to the additive partials, boosting and reducing the amplitudes at the various frequencies. Probably not worth doing, though, as it puts a lot more load on the CPU.)

IIR Filtering

IIR (Infinite-impulse-response) filters are filters which include a feedback loop between the output and the input, as in the filter ‘next_out = (prev_out + next_in)/2′. These filters have an “infinte” response to an input impulse, extending forever.

sAGAs directly implements a “biquad” IIR filter, which is a filter that depends on the current input, the previous two inputs, and the previous two outputs. This is a common building block for IIR filters; rather than implementing higher-order filters explicitly, you simply cascade several biquads.

sAGAs can compute a biquad filter at the same rate that it can compute an FM waveform or a pair of sine waves, or more clearly instead of computing such.

(Biquad filters are 12db/octave; Moog filters are 24db/octave. You’re not going to get a Moog filter by just cascading two biquad filters, but it gets you into the right ballpark, and is pretty cheap in sAGAs.)

FIR Filtering

An FIR (Finite-impulse-response) filter is computed from some number of recent inputs, without regards to the outputs. sAGAs has no dedicated hardware for this operation. The PCM waveform resampling can be used to compute two FIR taps on a waveform loaded into sample memory, or the mono-to-stereo hardware can be used to compute a single FIR tap with more precision.

FIR filtering will thus require many passes and use a fair amount of compute time. This is ok, because most “musical” filters will be biquads. FIR filters will be used primarily for system purposes, such as downsampling oversampled waveforms, which only needs to be done to the final stereo mixes of those waves, rather than per-voice.

A similar technology can be used to implement delays (echoes), and delay-related effects like flangers. (If delay tips need to fall between samples, then the data must be loaded into sample memory so it can be interpolated; the mono-to-stereo processor doesn’t provide interpolation.)

sAGAs

So, I’m working on a new FPGA project, using Milkymist, instead of working on Joaf.

Why drop Joaf? Well, the only reason I was doing Joaf was because I thought it would be fun. There was no plausible “market” for it (the possible, but implausible, one I could think of was that it could become a platform for demosceners to play with). Having done as much of the design as I did, I got a lot of the fun out of it, and actually producing it isn’t necessarily that much more fun. It would be nice to actually have working hardware, but if another project comes along that has some actual utility, it probably makes more sense to move onto that.

That project is called “sAGAs”.

sAGAs is an FPGA (or ASIC or whatever) component for performing audio synthesis.

It is capable of many modes of synthesis: PCM waveforms, additive, band-limited analog, granular, and FM.

It’s not going to change the world or anything, but there might actually be a market for it: a boutique “analog” synthesizer that’s fully, deeply programmable.

To make this work, probably in many cases you would want to synthesize at a higher frequency and downsample to reduce aliasing. These means making it run very fast is useful. When running at 44.1Khz, therefore, it should be able to produce an absurdly high number of voices (leaving headroom for oversampling at 4x or whatever).

Back-of-the-envelope calculations suggest a single sAGAs core can compute about 9000 mono sine waves or mono PCM-sampled voices per second with 44.1Khz output. However, the PCM-sampled voices would be RAM bandwidth limited (if every wave is independent, it requires 6Gb/s), and it may not be possible for the soft CPU to feed all this data fast enough. In practice, if the waves are independent, you’re actually limited to about 2000 voices/waves per second, since the mono-to-stereo mixer will become the gating factor. The higher numbers are achieved by mixing together multiple waves into a single mono sound. For example, additive synthesis with 128 partials (harmonics) would allow 70 mono voices, thus barely touching the mono-to-stereo mixer’s limits. (70 isn’t absurdly large, but additive synthesis doesn’t require oversampling, as you simply don’t mix in the partials that exceed 22Khz.)

You can also throw in some filter effects using an IIR biquad–basically each cycle the waveform-generating unit can either produce a pair of waveforms, or perform an IIR biquad. So if you split half and half, you can do about 2000 PCM waves, each with one biquad applied, and be running every functional unit at 100%.

This is all using one sAGAs core. The FPGA in the Milkymist easily has room for two, and might even have room for three (each sAGAs core uses 14 “DSP” multipliers, of 58 total). The sweet spot is probably two plus an additional coprocessor dedicated to generating the command stream for generating the waves, since the soft CPU is going to have trouble keeping up.

One goal would be to create a standalone sAGAs device (for now, using the Milkymist hardware), which could be used as a low-latency programmable synthesizer and possibly also be designed for monome input, picking up the work I was doing on sGAR.

Why do this on an FPGA? Partly to scratch my FPGA itch that Joaf was for. An alternative, I could use the Beagleboard; the OMAP DSP is pretty powerful (eight 16×16 multiplies every cycle, clocked at 1Ghz; dunno what fraction of those you could achieve, though, but that’s a potential 8B/sec, where one sAGAs core delivers ~1B/sec). But a cool thing about sAGAs on Milkymist is that the whole system would be open source; although the FPGA itself isn’t, you can build it on any FPGA with 18×18 multipliers, or even make an ASIC. With Beagleboard you’re locked into that DSP.

Or it could just be on an x86, like a laptop. sAGAs is computing 4 interpolated wavetable lookups every cycle at 100Mhz. On a PC, you’d probably need around 15 instructions to do one interpolated lookup (and extra stuff); with a 2Ghz processor and 2 instructions/cycle that gives you about 2.6 interpolated lookups at 100Mhz — using only one core. For sine waves, you can compute them by rotating a 2D point; this costs 4 multiply-adds plus some setup, so call it 10 cycles/sample–but you can use SIMD to compute 4 simultaneous ones. That brings you to 16 sine waves per 100Mhz cycle on one core. Of course the x86 isn’t open source, but it’s ubiquitous. (Then again, we’re not comparing equal-cost chips. I’m not sure what that would be. The highest-end Spartan-6 offers 180 multipliers, so on one of those we could have 12 sAGAs cores for 48 sine waves per 100Mhz cycle. Might hit memory bandwidth limits there, though. The highest-end Virtex-6 has 864. The forthcoming Xilinx next gen will be many more.)

Regardless, I think I like the idea of top-to-bottom open-source hardware, like the monome. A monome-philosophy programmable synth would be awesome, I think. So that’s why I’m going with the FPGA.

Over the next few posts I’ll describe how sAGAs works, and the many different synthesis modes it supports; to whet your appetite, here is an incomplete circuit diagram for the core mono-synthesizer/biquad-filter.

Milkymist

So the reason I didn’t really want to make a Projects blog at all, originally, is because I hate revealing projects that I never finish, and clearly Joaf and sGAR now fall in that category. Not really sure what the point of documenting failure is, and there’s some worry that publishing the stuff encourages me to never finish it (since publishing it gives me some of the satisfaction of sharing the stuff).

Anyway, both a lot and very little has happened since the last update.

The very little: everything that has happened has only happened on paper.

The a lot: I discovered Milkymist. I’d been waiting for a long time for FPGA Arcade to start selling, but it never happened. Milkymist did, though, and I own one, even though I haven’t started programming it.

If you look through the rest of the blog, you’ll see I was really hung up on the RAM problem. Milkymist solves this, and provides some other nice stuff.

What is Milkymist? It’s an FPGA-based system-on-a-chip, with “retail” hardware with tons of features – VGA, stereo audio, MIDI in/out, USB, ethernet, etc. It’s got a bunch of video hardware for doing Milkdrop-like effects for VJing, none of which I care about. But since it’s all in the FPGA, I can strip all that stuff out. It seems pretty great as a development platform: all the hardware connections I need, a decent FPGA, a good RAM solution.

So here’s how RAM works in Milkymist: the system uses 32-bit DDR SDRAM, and the system is clocked at 100Mhz. The RAM is set to send bursts of 8 32-bit values; because it’s DDR, it sends two per clock. Thus memory is read at 64-bits per cycle, producing a theoretical bandwidth of 800 GB/sec, or 6400Gb/s. However, DRAM has overheads due to page switching and refresh logic; actual measured performance is estimated as between 3000-4000 Gb/s, or about 400 GB/s. This is much higher than I was originally imagining, but because it reads in blocks of 32 bytes, it’s necessary to have significant caching logic to make it workable, which as I mentioned is fairly tricky for one important case of Joaf (drawing vertical Doom spans). Fortunately, I’m no longer working on Joaf, and big bursts are fine for my new project (see next post).

Milkymist is built on a Xilinx Spartan-6 FPGA. The X45 includes 116 block RAMs (each 18KB, separable into 2 9KB ones), and 58 “DSP” slices, which are 18-bit multipliers, basically, but can actually compute (D-B)*A+C, which is exactly what you need to compute an interpolation.

The existing Milkymist logic uses an unknown number of those elements, but I’m just using Milkymist as a development environment; I’ll just strip it of things to make room for my stuff. So Milkymist has totally brought me back into the fold.

Unfortunately, it’s all still on paper because setting up the development environment is kind of a pain in the ass.

Joaf memory latency 2

Last post I was talking about DDR SDRAM, and the complexity of using it.

What about SRAM? I originally assumed it would be too expensive, but let’s take a look. Looking into part pricing on digikey, it appears that 256k x 16bit 10ns Asynchronous SRAM can be had for $5-10, so 1MB would be $10-20 (on top of $20-30 for the FPGA). That’s a reasonable tradeoff if the Asynchronous SRAM makes sense.

While the operation is conceptually simpler, it’s still pretty messy; my FPGA book’s “simple” SRAM interface requires 2 50Mhz clock cycles of latency (i.e. 40ns, despite the part being 10ns), and requires another cycle in between operations. Even bringing it down a cycle, and using 2 16bit parts, that would be a bandwidth limit of 32bits x 25Mhz = 100MB/s. It’s possible to achieve higher performance but you have to get messy and use FPGA-specific features. With 10ns RAM the limit appears to be 20ns (given off-chip delays and such).

Note that in the DDR design I was assuming I would always read 64 or 128 bits at a time because the latency was so high and it’s page mode, so it was better to add a tiny bit more latency and improve the bandwidth. Here the latency is low and there’s no fast repeated reads, so I’ve doubled the SRAM chips (from what my starter board provides) to double the bandwidth up to 32-bit.

Assuming no caches, every pixel Whiz writes would be accompanied by an SRAM memory read, so if the SRAM read takes two clocks back to back at 50Mhz Whiz would be limited to 50/2 megapixels/second. Given the display requires 4.5 megapixels/second, this limits us to a depth complexity of about 5, and we don’t get a speedup from using paletted data. In the naive design the performance limit is 50/3 megapixels/second, for a depth complexity of 3.

If we assume a streaming cache (just cache the last 32-bit read), then we would be limited to the above rate for rotated data, but non-rotated paletted data could process 4 pixels without needing a memory read. Then the pixel pipeline should be written to always write every clock, and will need to pause whenever it goes to memory; or we could build into the pipeline a two-cycle latency and only pause it if back-to-back reads are required. (We’ll also require reads for the character lookup.)

That’s probably a reasonable middle ground, so I’ll proceed with this design.

The Cyclone II starter kit I have only has 1 SRAM, so I’m more bandwidth limited–I can only read 16 bits at a time so pixel formats that use 16 bits will be limited by SRAM speed, and other accesses to memory will be slow. If I switched to Spartan 3 I believe those have 2 256Kx16 SRAMs so I could test the full performance.

Joaf memory latency

I posted before that where I stopped Joaf was because I couldn’t figure out DRAM latency to finish the paper design.

I finally tracked it down here.

Using Xilinx’ DRAM controller, the expected latency from initial memory issue to readback of the first bytes is 17 cycles.

If I understand it correctly, this is with the memory interface clocking at the memory bus speed, e.g. 166Mhz or 200Mhz (for DDR2-333 and DDR2-400 respectively) — there’s a separate app note than the above for 400Mhz DDR2. I presumably won’t be able to make the rest of the system run at 200Mhz, but I think it’s feasible to make it run at some integer fraction and have buffering to communicate from one to the other. (Why they wouldn’t have made that part of the interface themselves, I don’t know. Again, it’s kind of a gross part to have to implement up front.)

This leads to about 12M reads/second; fudging it a little to allow time for some extra cycles for a short burst read of some length, about 10M reads/second.

For 320×240 at 60hz, there are 4.6M pixels per second. (And I’d really like to target 640×480, but it’s clear this isn’t feasible while doing anything interesting.)

If we assume about 80% of memory time is used for reading pixels (the rest for display list reads, read/write from soft CPU, audio, etc), then we get about 8M reads per second. Assuming we can read & use 64-bits at a time per read (we can actually go faster than that if we stay within a row, so it might be better to define the interface as always reading even 128-bits), that gives about 64MB/s, or about 14 bytes per pixel.

With paletted images, that’s an average depth complexity of 14; with 16-bit images, an average depth complexity of 7. That seems like plenty! (But it would be pushing it for 640×480. Although we could consider 640×240, i.e. interleaved 640×480.)

The 200Mhz thing is actually fairly aggressive (can’t be achieved on the eval kit directly); 166Mhz is more plausible, which brings it down to a depth complexity of 10/5.

If rotating (or e.g. Doom spans), so each read produces only one pixel, we have a depth complexity of something like 1.2; which means we can rotate the whole screen and almost nothing else, or we could rotate individual sprites but not the background. Sounds reasonable!

The thing that rotation needs for performance is cacheing, but because of the line buffering approach, the cache has to last long enough to get to the next line to be processed. Non-rotated data won’t take up as much cache (instead of each displayed pixel requiring its own cache line, multiple displayed pixels come from the same cache line), but if we ONLY cache rotated data, we might make it more manageable.

Spartan-6 is the current Xilinx thing, but I’m not sure they’re actually available yet, so I haven’t found any prices. A reasonable Spartan-3a offers 360Kb of block RAM; if I assume I could use half of it for a cache. A Spartan-6 with a similar number of slices offers 936Kb-2Mb of block RAM; a slightly reduced version in slice count, but with more logic cell estimates, still offers 576Kb of block RAM. The current design puts all palettes in block RAM (say, 2048*32bits = 64Kb), plus all active spans (say, worst case of ~400 spans of 200bits is 80Kb) plus the current heads of all display lists (minimal). This requires something like ~150Kb of block RAM.

Two full rows of rotated pixels will need 640 cache lines. To make the cache addressable we’ll make that 1024 cache lines. Each cache line will need to store a few pixels to get useful cacheing; say, 64-bits. Each cache line will also need to store the line’s address; say 20 bits. So we’d need to store at least 80Kbits in a cache to make this effective. And hey, that sounds pretty feasible, in fact we can bump it up a lot more. Maybe on the Spartan-6 maybe we should just make all memory accesses go through a cache, although having the CPU cache interfere with the GPU cache might be bad, because we want the GPU to be predictable. But we can make all GPU accesses go through a GPU cache, and then a unified cache for CPU/audio. Given the size available on Spartan 6, maybe we bump up the GPU cache to 4096 entries = 320Kb of block RAM. This might make higher resolutions more feasible, too.

ZSAMIF – massive detuned oscillators

Just before the weekend before I got my monome, I saw a short clip from How To Destroy Angels showing the Swarmatron being played.

At this point in sGAR’s life, I had the basic software running, I had the grid sequencer and the mixer up, but the grid sequencer was controlling this crappy soft synth I had written for a videogame, and I wasn’t planning on shipping. I knew I needed a proper softsynth, and the Swarmatron was perfect inspiration. I would make a softsynth with tons of detuned oscillators; computers are fast and can do a good job of this.

Working on paper, I eventually came up with the performance-oriented design found in the demo below. A virtual keyboard is at the bottom, and then controls are in the middle and top. There were so many controls I realized I’d have to have subscreens with all the different controls. To allow you to play music with both hands and then change controls on the fly, I made a design where the modal controls (the ones that aren’t sliders) are duplicated on the left and the right. Leaving padding, this meant sliders in the middle could only have 12 values.

Thus the original ZSAMIF design was born: it would have 1 to 12 oscillators. (In the final version, the 12 buttons are remapped so the buttons further to the right add more than one oscillator, so the actual range is 1-32 oscillators, although I think more than 12 turns out to be pointless.)

Originally ZSAMIF was called MAZZIF, as a nice google-able misspelling of “massive”, but eventually I discovered there was already a softsynth named “MASSIVE” and I didn’t want people to think mine was a clone of theirs, since I’d never heard of it when I designed mine.

Since that video was made, I added many features. Most important is a “paramter change rate” control, which allows you to change a parameter in the UI but it only changes gradually in the synth. This lets you do things like PWM-sweeps automatically but user-triggered. There’s also a variant where parameters only change when you hold them down, and then change back when you release them. This lets you get a nice live-performance effect and was directly inspired by the How To Destroy Angels video.

ZSAMIF offers 4 waveforms: sine, triangle, square, and sawtooth. each one has two variants, which respond differently to the “pulse width” setting (which perhaps should be thought of as “morph” rather than “pulse width”).

All 32 oscillators use the same waveform. The oscillators are controlled in parallel; a single slider sets the “spread” for a given value for all the active oscillators. So the “detune” slider controls how wide the pitch detuning is. The “octave” slider determines the degree to which different oscillators are in different octaves (or overtones). The “pitch lfo depth” determines the degree to which the different oscillators pitch bend differently in response to the pitch lfo, and the “pitch lfo speed” slider determines the degree to which each’s oscillator’s lfo runs at a different speed from the “base” speed.

I also added the option for sequential ring modulation (where each oscillator ring modulates the previous). This is… weird. Ring modulation tends to drive a signal towards 0 because it’s multiplicative, so having more than 3 oscillators is usually a bad idea. The square wave is an exception, since it’s never 0, and indeed each multiplication by 1 or -1 still leaves the signal at 1 or -1. One of the two “pwm” square waves stays alternating between 1 and -1 and shrinks (the “naive” square wave PWM). The problem is this creates a DC bias (as it’s more often 1 than -1), and when you overlay dozens of oscillators, the DC bias creeps up to the point where it can be at the limit of the range of the audio. This is the source of distortion/cut-out about halfway through the video. However, while this mode is bad for regular oscillators (so I planned to fix the DC offset by making it vary between, say, 0.25 and -1 when it’s at 80% high), it’s awesome for ring modulation since the signal loses no energy at all. The ringmod PWM turns out to be one of the most interesting sounds I’ve discovered while doing this.

There are many, many more controls.

From an implementation standpoint, ZSAMIF doesn’t use any of that clever “minBLEP” stuff that is needed to get high-quality anti-aliased digital waveforms. I just generate with 4x oversampling and filter it down. This is because minBLEP-style stuff isn’t compatible with the arbitrary waveform “PWM”-ish stuff I’m doing (maybe–it’s definitely not for sine waves). Plus it would have been a bunch of hard work and I just wanted to bash it out. If you put in high enough overtones/octaves you can definitely hear the aliasing when using the detune to make the pitches move; presumably it’ll be more fairly obvious in those extreme cases once I put in a monophonic mode with portamento. Perhaps someday I’ll revisit the possibility of a minBLEP-ish implementation (I have one in another directory).

sGAR feature list and initial programming

This is the raw list of pages from the design document as it currently stands:

1. Mixer
2. Track Editor
3. Track Pitch
4. Pattern Editor
5. Instrument Selector
6. Note Performer
7. Pattern Performer
8. History Rewinder
11. Track EQ
12. Song Loop Layout
13. Instrument Paint
14. Track Ordering
15. Set Tempo
16. Analog Synth
17. Step Filter
18. Guitar Looper

Each of these has a sample layout and a lot of text explaining the buttons and their functionality. When I first started programming sGAR, the list was probably 75% as long, and it’ll be longer when I’m “done”, if there’s any such thing (since I’ll probably just always put random whatevers in this).

I wrote a little OpenGL grid display and made it so you could click on buttons with the mouse. To support “multitouch”, I made it so you could click on the grid with the left mouse button, hold the button down, then click somewhere else on the grid with the right mouse button, and as long as you kept holding the left button down, the corresponding button stayed down.

I built the app in a “framework” that was just a copy of the source to an old unfinished videogame that had a soft-synth and some drum samples built-in for doing a simple music track in the game.

So I started programming the grid-based note editor, and I wired it up to trigger the existing bass sound from that videogame soft synth. Wanting to make it sound more interesting, I went on musicdsp.org and built a graphic equalizer and hooked it up in sGAR. Then I went back to musicdsp and pulled down the formant code off that. It’s kind of dumb — just five vowel sounds — but hey, it’s code and it just works without me thinking about it. Created a little “random formant start/end sound” screen in the mixer and voila. A lot of craziness even if I was restricted to a dumb synth bass sound.

sGAR design documentation

sGAR began life as a text file I wrote documenting the design. I began writing it before it was 100% locked down that I could get a 256 soon, but I was impatient to start.

I find it is very useful to document the design on paper, in this case with a little 16×16 grid of characters, because it’s a lot faster to design on paper than it is in code. So I tend to write a fairly thorough design document, trying to cover every aspect of the thing, although it’s not meant for end-users so it’s terse, albeit exact.

For example, here’s one of the “pages” as originally designed, which is basically an mlr ripoff (without me ever having used mlr, so randomly different, and also designed to integrate in a complex way with two of the other pages, so missing things like groups). Note that some of the user interface ideas written down are contradictory, and some turn out to be bad ideas.

I also wasn’t sure how much PWM I was going to be able to get away with to create intermediate intensities (as it turns out, essentially none: I’m using one level, but it’s very flickery), but I quickly found I couldn’t really do effective designs without using some PWM, so I assumed I’d be able to use one level. Thus “0″ is on, “.” is off, and “+” is 50% brightness.

=== page Pattern Editor ===

0 0 + 0 0 + 0 + 0 + 0 0 0 + 0 0
0 + 0 0 + + 0 + 0 + + 0 0 + 0 +
0 . 0 . 0 . 0 . 0 . 0 0 . 0 0 .
0 . . . . . . . . . . . 0 0 . .
+ + + + + + + + + + + + + + + +
0 0 + + 0 0 + 0 + 0 0 + + + + +
. . . 0 . . . . . . . . 0 . . .
. . . . . . . . . . . . . . . .
+ + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
+ + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
. . . . . . . . . . . . . . . .
0 0 . + + . + + + . + + + . 0 +

This page effectively shows all the note editors
for all tracks simultaneously, by collapsing out
pitch information.

In the sequencer, there are up to 14 tracks, each with up
to 16 “time events”. Normally each time event lasts for an
eighth note, but see discussion below.

Each row corresponds to one track, each column to one
time event. A given coordinate is lit up if there is “data”
at that event time. (A note in the note event, or above-threshhold
audio in the beat, not counting the first N milliseconds.)

The 15th row selects whether each event lasts for an eighth or a quarter
(or, equivalently, a sixteenth or an eigth). This allows time signatures
longer than 16 beats at the cost of only being able to have events
occur on 16 of the beats. So for example 5+5+5+2 could be programmed
as ‘O… …. …. ….’ with the first event lasting for a quarter and every
event after that an eighth. Swung time can be programmed by
alternating long and short beats.

Pressing a button causes that track to restart
at that point in time. Framing (both-handing) a
horizontal or box causes that track (or tracks)
to loop that region of time. If the dragged box
includes beat 1, it will synchronize with the
current looping rather than jump in time.
The time signature can be changed by dragging
out a box covering all the tracks. (This is actually
detected as a special case, and the pattern loop
controls are altered. Maybe should make an
explicit control for it.)

Pressing two buttons on the same track at the
same time causes the track to play from both
places at the same time.

LMODE is a shift key allowing specifying the length
multiplier for each track. Press LMODE then MMODE to
lock LMODE one-handed.

. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. + + . + . 0 + + + + + + + . .
. . . . . . . . . . . . . . . .
0 0 . + + . + + + . + + + . 0 +

Each track can take up
1 or 2 beats, or 0.5 measures, or 1 measure, or 2 measures, 3 measures,
up to 8 measures.

RMODE allows button presses to be recorded,
so a given track will play out of sequence according
to the button hits.

At this writing this page is entirely unimplemented; but that’s kind of the point, even the pages that are implemented started life like this.

sGAR overview

sGAR is a collection of music creation tools which use a monome 256 as a controller.

I’ve always toyed with the idea of writing a guitar effects processor or a audio music editor using a PC. Fortunately Reaper came along and did a good enough job of the latter that I’m not tempted anymore.

Lately there have been a few Flash-based grid music-creation tools (typically in a pentatonic scale). The mouse isn’t really a good interface for it, but it also reminded me of the great experience of programming the Roland TR-707, which is where I cut my teeth learning what drum parts sounded like, and remains to this day how I visualize drum patterns.

Cut to discovering the monome: this was just perfect. Grid-based editing, but you can hit multiple buttons at once. Actual buttons, not sloppy multitouch (although this is a clever design for using multitouch that I think would have worked really well; shame about the touch surface company being acquired by Amazon and dropping the product). Programmable. Open source.

Sounded like just a thing I could use for all my music software needs, and ones I didn’t know I needed.

I bided my time, hoping to get my hands on a monome 256 (a 16×16 grid), but monomes are built in very small qualities (since they’re handmade by a single couple). I bid on the 512s when they auctioned off 4 of them, but it was just a tad out of my price range.

Fortunately soon after that I was able to swing a deal for a 256, and there was no looking back.

joaf hiatus – RAM

So, I haven’t updated this blog in nearly a year, so I guess Joaf is officially on hiatus if not dead.

The issue I ran into was the problem of accessing RAM from an FPGA. To complete my paper design, I needed to get a ballpark idea of what kind of latency I could get reading from RAM.

The problem is that it depends on what RAM I chose, which was a free variable. Plus I couldn’t be 100% sure just reading data sheets, because interacting with RAM is kind of confusing. And if I was wrong by a factor of 2 in the wrong direction, I’d end up with something which could only run at half of a reasonable frame rate.

The reason the design is latency-limited, not bandwidth-limited, is because it didn’t seem feasible to use an on-chip cache for the graphics. Because of the row-at-a-time design, you could get locality on a single scanline, but by the time you got to the same sprite on the next scanline, any reasonably-sized cache would be flushed. And if you rotated a sprite you’d get basically no locality at all. So a design where it fetched every pixel explicitly from RAM seemed a lot more predictable than trying to rely on cache; I could say rotation was free.

But I couldn’t figure it out on paper, and so it seemed like the only way to figure out what was feasible and plausible was to actually implement a DRAM interface, but that stuff is pretty far down towards the deep end of FPGA development… it seems to be the really hard stuff, and not where I wanted to start when it came to actually programming the FPGA. (I wouldn’t have to do the final version, just a prototype, but even so…)

All of which added up to an impassible barrier to further development: I just couldn’t collect the data I needed to move forward with the paper-design-so-far.

Unless I can find a way to figure that out, it appears the project is over.

Next I’ll post about the new, non-Joaf project I started a couple weeks ago.

sprite list offsetting

Last time I described how sprites are represented as “spans” and how they can be sorted together to be used later with painter’s algorithm.

The input to that step was a set of “sprite lists” which Whiz will sort together; each sprite list is itself sorted (top-to-bottom, back-to-front). Whiz keeps pointers to each sprite list in block RAM and uses those pointers to fetch one sprite at a time from DRAM. I’ll talk about how to get those sprite lists into DRAM in the next post. For this brief post I want to put some finishing touches on sprite lists.

Each sprite descriptor loaded from DRAM consists of 2 or 3 64-bit words. There are 8 different descriptor formats; this effectively compresses the descriptor which might take around 4 64-bit words if it could express every possible aspect in one format. (I don’t know if that complexity is worth the memory/bandwidth savings; because MC exists it doesn’t affect Poet-side performance.)

Somewhere in each sprite descriptor is an x coordinate, a y coordinate, and a z coordinate. When we load each descriptor from DRAM into Whiz, we expand it out to the full 4×64 description. While we’re doing that expansion, we can add a global sprite-list offset to each of x, y, and z. As long as none of those values wraparound, the sorting of the sprite-list against itself will be unchanged (since the same value is added to every sprite). That means the post-addition list is still sorted properly, so it will still interact correctly with the rest of the system.

Note however that we don’t allow scaling or rotating the entire sprite list. We don’t allow sprite rotation in general (a rotated sprite is no longer rectangular–we require the user to describe a rotated sprite as a series of separate 1-line sprites). Scaling down might cause objects starting on adjacent source scanlines to draw starting on the same scanline, so allowing that would require some resorting of the list. Scaling up might be possible, but it’s a direction that just exposes imprecision (e.g. the lack of subpixel coordinates on the source data) and the other restrictions make it feel non-orthogonal. And it seems wasteful to reserve FPGA multipliers for this possibly rarely used feature.

But just addition is pretty powerful. There’s a whole complicated system of sprite maps to allow you to do big blocks of things that move together for backgrounds and e.g. parallax scrolling. However, sprite maps are doubly-indirect per pixel (or, possibly, they might be written to by two nested loops, although that’s painful if scaled and implausible if rotated). Sprite maps still require pixel processing in open areas (even if we say that tile id 0 is always blank and can be early-outted, once we’re pipelining pixel operations there’s no meaningful way to early out–the only way to get this savings is if it were done with nested loops as mentioned earlier–but that way also is worse with pipelining if it requires pipeline flushes).

Sprite-list offseting offers another way to create big moveable mass of stuff, by building the background as a sprite list. There are a few differences: maybe faster because it’s not doubly-indirect. Sprite elements don’t have to be square. Sprite elements can overlap each other. (E.g. a complicated boss monster. E.g. a big map of square tiles, except each tile extends off the edges a little with alpha blended edges.) Things could look like Braid.

To make this work we need to handle offsetting a sprite list off the edge of the screen. We explicitly don’t handle a sprite list being the entire world; it will always have to be a screen-sized chunk of the world. This gives us fewer sprites to cull and requires less precision for the x & y coordinates.

Let’s pretend the eventual application of this architecture is a screen that’s 1280×720. (We won’t handle this day one, but we forsee FPGAs becoming fast enough that that resolution is possible, so let’s see what the math bears out.)

The x and y coordinates for a single screen need to range from 0..1280 and 0..720. If we restrict the largest representable x value as 2048, then we would have to limit the offseting to 2048-1280. In fact, we don’t want to have to enforce an assertion like ‘offset < 800′, because we’re in hardware, so instead we enforce an assertion like ‘-1024 <= offset < 1024′, which enforces some limited number of bits.

For 1280×720, we could limit the offset to -1024,1024 in x, and -512,512 in y. This would force the use of smaller-than-screen-sized sprite lists (if they’re scrollable). Or if we want to allow screen-sized sections, we need to use -2048,2048 in x and -1024,1024 in y.

We then have to accurately store these numbers (without wraparound), so we’d need to store (0-2048, 0-1024) and (1280+2048, 720+1024), which means we’d need to store x as -4096,4096 and y as -2048,2048. This requires 2 extra bits for y and 2 extra bits for x. (In fact, it’s easy to show that support-a-screen-sized-section will always require 2 extra bits.) A non-screen sized section needs to allow a max of (1280+1024, 720+512) which turns out to still need the same number of bits of post-offset x and y. (This might be provably true, I’m not sure.)

Oh but wait! I forgot to mention that the original x & y values need to be able to be negative (because it’s too annoying to force Poet to do the clipping for them, given that we have to have clipping hardware for the sprite list case). And they have the same bit representation issue (it would be weird for the hardware to blow up if you go outside some legal range). So that suggests initial x should be able to be [-2048,2048) in the first place. Fortunately this still doesn’t affect the output range (again, probably provably).

Now, if we suppose our actual resolution is 320×240, then this works out to sprite values being: [-512,512) and [-256,256) and offsets being [-512,512) and [-256,256). Note that we’d like to encourage using smaller-than-screen-sized sections, because they’ll require less processing of clipped sprites; so we could shrink the allowed offsets to half the size to force that, but then we’d be wasting output precision (which has to round up to the next power of two). So instead we expose all of the precision and just advise users to use smaller sections. They’ll be able to see the performance difference for themselves anyway.

That’s about all there is to that. Each sprite list will be programmed as a pointer to the sprite list, and the x/y/z offsets. The x/y/z offsets won’t be stored in the sprite list itself (e.g. in a header of it), because (a) there’s no header currently, and (b) because this allows you to instantiate multiple copies of it on a single screen.

Note also that by design a sprite list like this can be in ROM; that is, there’s no requirement for Poet to construct it on the fly (like there is if you have no x/y/z offsetting). Thus you can easily imagine cases for using it trivially. Use it to directly represent static scrolling backgrounds. Have 30 variant ROM sprite lists to represent an animating boss character, and pick the right one and position it each frame. (This is why I assume the 2/3 64-bit word “compressed” format is worth it; to save on ROM space and not require full-fledged decompression.)

One limitation with this approach: how do you do non-full-screen-sized scrolling? This is easy with tile maps; the size of the tile map sprite rectangle is the size of the scrolling screen. This method inherently relies on the screen clipping to determine where it renders. Unless we add clipping to some other rectangle, it can’t really be used for anything else. Well, you could draw as a background in a smaller rectangle by giving it the smallest z value, and then drawing some other background over it whereever you don’t want it; but there’s no way to e.g. do a split screen with two separate layers.

When I actually write the clipping hardware I’ll look at how hard it would be to do arbitrary rectangle clipping instead of screen clipping, but I don’t see any other easier approaches.

span sorting

Last time I wrapped up a “tangent” explaining in detail how one pixel is computed and how the pixels are computed in parallel using pipelining.

To return from the tangent, then, I’m going to pick off where I left off in describing how data flows from the CPU to the display, which I’ve been doing by walking backwards through the process.

As described so far, the screen is drawn as a series of rendered line buffers; to draw a single line buffer, Whiz process a list of “active spans”.

Each span actually represents a sprite rectangle. While Whiz draws into a given line buffer, it creates a temporary copy of the span and updates certain data that needs to change “horizontally”. Additionally, Whiz updates the span “vertically” and adds it to the new list of active spans for the next scanline.

So really each span descriptor isn’t a span descriptor, but a “bottom of a sprite” descriptor–each one starts describing a full sprite, but then as the sprite is displayed it is updated to only characterize the parts of the sprite that are left. For various reasons I’m going to continue to call this a “span”, anyway. (Basically, when it’s a sprite, it has an explicit Y value; but once it starts being displayed, it no longer stores an explicit Y value, and I stop considering it a sprite.)

The list of active spans contains one entry for every sprite which intersects a given scanline. When the last scanline of a sprite is reached, the span is deleted (or rather, isn’t added to the list for the next scanline). When the first scanline of a sprite is reached, the sprite must be added.

The list is drawn in order, so the list must be sorted in “painter’s order” (back to front).

Any given pair of sprites will always draw in the same painter’s order. So as long as we keep the active list in the same order as we updated it from one line to the next, we don’t need to re-sort the sprites.

What this means is that we therefore only have one actual problem to deal with: adding new sprites into the active list in sorted order.

Suppose we have a list of sprites we want to draw in a given frame. Because of the way we process a line at a time, we’ll need to add sprites that appear higher-up on the screen to the active span list before adding sprites that appear lower-down on the screen — the sprites that appear higher will need to start getting drawn into the line buffers sooner.

The upshot of this is that we need to add sprites in the following order:

  • Sprites are added in order of the first scanline they draw to
  • Sprites on the same scanline are added in z (depth) order
  • Sprites added to the active span list must be positioned in correct z (depth) order relative to the existing active spans

The job of adding sprites to the active span list is entirely Whiz’s job — it will be processing a list of sprites and updating the active list. But the task of adding them in the right order can’t be solved by Whiz — we can’t just hand Whiz a pile of sprites and expect it to do the right thing. So instead we can just require the sprite list created by Poet to be in the right order:

In a sprite list provided by the CPU, sprites must be sorted in order first of their starting scanline, then in their z-order.

The sprites must still have an explicit z-ordering value; Whiz uses this to merge the new sprites in with the active list by comparing their z values.

However, forcing Poet to fully sort all of the sprites all the time isn’t particularly ideal. That sort is expensive, and Whiz has some free cycles to be able to do something smarter. So Joaf supports multiple sprite lists (each of which must be sorted as above). In a given screen, you might use 2 “static” sprite lists for separate scrolling backgrounds, one sprite list for a boss monster made out of multiple pieces, and one sprite list for all the general moving objects. In general, you can avoid sorting entirely by intentionally designing the display in ways that allow the multiple sprite list usage, and being able to construct each sprite list explicitly without sorting.

Note that there is a limit to the number of static sprite lists due to their use of block RAM, and using more sprite lists increases the cost per sprite of adding them to the active span list, as described below.

I can now characterize Whiz’s algorithm for processing the active span list in each scanline, given multiple input sprite lists.

Process A:

  1. Wait until pixel-drawing unit is ready for a new span; then dequeue the head of the active span list; send a copy to the pixel-drawing unit
  2. Update the entry to the appropriate value for the next scanline; store to the “next active buffer”

Process B:

  1. Scan all the cached next-entry for each sprite list; choose the one that sorts first (smallest Y, then Z, then X)
  2. Copy the sort-first one into the “sorted next buffer”
  3. Load the next item for that sprite list into the cached-next-entry for that sprite list

Process C:

  1. If the Y value of the “sorted next buffer” is that of the next scanline, then if the comparison of (Z,X) for the “sorted next buffer” is less than that of “next active buffer”, do step 2 else do step 3
  2. Copy “sorted next buffer” into the active list queue; do Process B
  3. Copy “next active buffer” into the active list queue; do Process A

In practice process A is a little subtler; we always begin drawing the next span as soon as we can, and if it’s the last scanline of a sprite it won’t need to output an entry into the next active list anyway. So that should run ASAP without waiting until the next-active is ready to be output. So there’s actually another buffer there.

There’s a limit to how many active spans you can have, based on the block RAM usage for storing the active spans. (I haven’t decided for certain what the limit is, since it can trivially use multiple block RAMs to store different parts of it.) A Doom-style engine draws each column of a wall as a separate span, so I probably want to support ~500 spans at once for that case (320 vertical spans plus overdraw).

The cost of merging a new sprite into the active list is very small, because we use a simple merge between the two lists. The actual cost is limited by the cost of computing the next item to compare. For checking the existing active spans, this can get blocked by the cost of actually rendering the thing–we can’t compute a new one until we finish rendering the previous one. A one-item buffer here mitigates the worst-case cost (e.g. a span that covers the entire scanline, e.g. a scrolling background), but only somewhat. Then again, if span-rendering were to be so slow that it slowed down active-span computation to the point it didn’t complete in the time alotted for one scanline, the span-rendering itself would probably have that problem too.

The other side of this is more problematic. There are two limiting factors on the cost of computing the next new sprite to merge. One is the cost of loading the next sprite list entry from memory. We cache the first pending entry from each list in block RAM, so we only pay this load cost when we serve a sprite onto the active list. The worst case for this would be when we start a whole list of new sprites, all from the same sprite list. E.g. this will happen at the top of the screen. The second cost is the cost of scanning. With only one sprite list, we always compare with that sprite. With multiple sprite lists, we have to compare all the sprite list entries to each other and pull out the first one. This requires O(N) scanning time for N sprite lists, which will force N to stay small.

Overall, this shouldn’t be too much of a problem. If we run at 320×240 and support say 2x overdraw, then our pixel pipeline has to process 640 pixels per scanline, which, including overhead per span, probably requires around 800-1000 cycles per scanline. Alternately, 40Mhz (an arbitrary target) gives us about 2800 cycles per scanline. The triple-buffering of line buffers lets us suck up the cost of the occasional expensive scanline (like having a bunch of sprites start on the same scanline), and if in typical use we started, say, 20 sprites per scanline, even if we had to scan 10 lists this would still only require 200 cycles per scanline to sort. So overall it seems pretty good, but obviously the numbers could eventually get too large.

It might be possible to do a slightly improved strategy: keep track of which sprite lists have sprites starting on a given scanline, and only scan those sprite lists. Suppose we had 30 sprite lists (to avoid doing any CPU sorting at all), but we have 100 sprites starting on the first scanline. This would require 3000 cycles to process the original way. If we instead know that only 3 of the lists actually have any sprites on that scanline, then we can reduce the cost down to 300 cycles instead of 3000.

To make the sprite lists more useful, they can be “repositioned” (as if they were sprites themselves), which I’ll talk about next time.

pixel pipelining

Here’s the pseudocode for the beginning of the pixel processing from last time:

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    MEMORY.address := address + address_row
    MEMORY.read_8bits = 1

  4. [wait for read to complete; latch result into 'readresult']
  5. ; compute the address of the sprite data and the current pixel
    mapx := span.S[11..6] >> map.S_shift
    mapy := span.T[11.6] >> map.T_shift
    address_row := readresult << map.tile_shift

  6. ; continue above
    address := map.image + address_row
    mapx := mapx << map.pixel_shift
    address_row := mapy << map.pixel_rowshift

  7. ; continue above
    address := address + (mapx | address_row)

Within each step, all of the operations occur in parallel; by chance none of them happen to read and write to the same register in a single cycle, so it’s non-obvious, but normally I could have a single item that said ‘x := y; y := x;’ as two statements, and the point is that x gets the y value from the previous clock, and y gets the x value from the previous clock.

Now, let’s take the above fragment of state machine logic and pipeline it. To pipeline it, each state in the above sequence becomes its own little section of the hardware with its own inputs and outputs. Each pipeline step needs to know about everything about the current pixel. The stuff in the current pixel is stored in the structure named ‘span’ in the pseudocode, so that stuff can no longer be “global”–each pipeline stage must access a different pixel’s data.

We’re going to assume all the pixels in the pipeline come from the same span; that means all the references to the ‘map’ structure above stay the same. Let’s look at the first couple steps, and rewrite things to ignore the memory access timing problem (assume we’re getting the data from a fast cache and ignore what happens if the cache misses):

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address := address + address_row
    map_cache.read_8bits = 1

  4. ; compute the address of the sprite data and the current pixel
    newmapx := span.S???[11..6] >> map.S_shift
    newmapy := span.T???[11.6] >> map.T_shift
    tile_address := map_cache.result << map.tile_shift

Much of this is exactly as it was before; however, the numbers are no longer meaningful. Before the number indicated separate stages controlled by a state machine; now every one of those operations is occuring at the same time, and the numbers are just to show the correspondence, and to show how long it takes the data to have flowed down to a given stage.

The main change in the code is that we don’t reuse any registers; we’re just going to make new registers at every stage. Remember how I said each pipeline stage is its own little box with inputs and outputs? The registers actually live at the boundary between the pipeline stages–the output of stage N becomes the input to stage N+1.

(You can’t reuse registers because everything happens at the same time; if two different stages wrote to the same register (trying to reuse it), they’d both be writing to the same register simultaneously, which you can’t do.)

Technically, as described so far, you just need an array of simple flip-flops, i.e. a buffer, to serve as these input/outputs, not a register. A register adds the ability to keep its old value instead of updating. In normal execution, we never need this ability, but if the pipeline needs to stall waiting for memory, then we need to freeze the state of the pipeline before the stage that’s blocked — which means latching the state in the registers.

But what about those span.S??? things? The problem here is that if we’re pipelined, then by the time the data flows to stage 4, the S and T variables will have been incremented 3 times (since it takes 3 cycles to flow to there). At stage 4, we need what are now old S & T values. The audio way to think of this is that we need a delay line. The simple way to think of this is that each pipeline stage must have all the information for the pixel available, and if that info varies per pixel, it needs to be a pipeline input–which means it needs to output by the previous register.

So naively we do this:

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask
    S1 := span.S
    T1 := span.T

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)
    S2 := S1
    T2 := T1

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address := address + address_row
    map_cache.read_8bits = 1
    S3 := S2
    T3 := T2

  4. ; compute the address of the sprite data and the current pixel
    newmapx := S3[11..6] >> map.S_shift
    newmapy := T3[11.6] >> map.T_shift
    tile_address := map_cache.result << map.tile_shift

Remember how last time I mentioned we wanted to minimize the number of registers and the total register bits in the system, and that’s why I reused variables? That’s still true here. We propogate S & T through the system with far too many bits given that all we’re going to do at the end is extract a few of those bits and pack them together.

So what we should do is extract the needed bits as soon as possible. And I’ll make the bit counts in the registers explicit, since we’re trying to minize them:

  1. ; find the bits of S&T that index into the sprite map
    mapx[8] := span.S[19..12] & map.S_mask
    mapy[8] := span.T[19..12] & map.T_mask
    S1[6] := span.S[11..6] >> map.S_shift
    T1[6] := span.T[11.6] >> map.T_shift

  2. ; first half of computing the address of the sprite map index
    address[24] := map.base + mapx
    address_row[24] := (mapy << map.yshift)
    pixoff2[12] := (T1 << map.S_count) | S1

  3. ; compute the sprite map index address and initiate a read of memory
    map_cache.address[24] := address + address_row
    pixoff3[12] := pixoff2

  4. ; compute the address of the sprite data and the current pixel
    tile_offset[24] := map_cache.result << map.tile_shift
    pixoff4[12] := pixoff3

Eventually, further down the pipeline, we’ll have to write things out, so another thing we’ll need to propogate down the pipeline is the current pixel’s x location. On the other hand, every x location will be one more than the previous one (or one less if the sprite is reflected horizontally), so it would be more efficient to store the span’s x location into an x register deep in the pipeline, latch it until the pipeline flow reaches that stage, and then just increment or decrement it there.

On the other hand, because we have block RAM available, it’s also possible we could try to store some of our “registers” to block RAM–specifically, when we compute something (or access it at the beginning) but won’t use it for some number of stages, rather than cascading series of hand-coded registers (which will use a bunch of FPGA resources), we can store all the data to a circular buffer in block RAM, and just propogate along the appropriate index into the circular buffer from which the relevant data can be loaded.

Block RAM has 1 read and 1 write port, which means that on any given cycle, you can write to it once and read from it once. So the above technique can only be used for data that is only accessed in one phase of the pipeline; the 1 write/cycle is used when initializing the “register” (either loading it from the initial pixel state, or setting it on the fly from some computation), and the 1 read is the read from one phase of the pipeline. It can’t be read from two phases of the pipeline, because the pipeline is pipelined, so all the stage happen at the same time (just accessing different pixels). In fact, each conceptual register must be stored in a separate block RAM, because we can only read/write once per cycle from a given block RAM, so if we tried to store multiple registers in one block RAM they’d end up being accessed from separate stages which will need to index different spots in the circular buffer.

Since the total number of block RAMs is limited, and this would only use a small amount of the storage available in each block RAM (I think they’re 4Kbits, or maybe 18kbits?) thus wasting most of it, I’m going to proceed with the design using explicit (non-block-RAM) cascaded registers. (This is, as far as I know, how real hardware has to implement pipelining.)

pixel pipeline

Last time I described the basic abilities of the pixel pipeline in terms of how each pixel from a span is generated one after the other, and the various properties each pixel can have.

This time I want to start drilling down a little and talk about what this looks like “in hardware” (or at least in the FPGA).

The most important thing to understand about hardware is that the magic comes from doing more than one thing at a time. Where traditional programming is inherently sequential, clock-based hardware is inherently parallel — every operation on the chip will happen every clock cycle.

Most hardware algorithms need to use sequential algorithms to accomplish their tasks, so they’re divided into a sequence of steps. A piece of “control logic” can then enable and disable each task on every clock; the upshot is that one small part of the hardware might be set up to have three steps, and to execute them in order, 1, 2, 3, and then repeat starting at 1. Typically such a process would work as follows: step 1 accepts input from the outside and process it to some temporary values; step 2 operates on those temporary values and produces new temporary values; step 3 operates on those temporary values and then passes them along to some other unit.

These controlled sequences may have variations; they might do different things after step 2 depending on some control inputs from elsewhere in the system, or based on the values of the data. In general a piece of localized, locally controlled sequential logic like this is called a “state machine”; it’s a machine that goes through a sequence of states which determine what it does.

If we look at the pixel pipeline defined last post, we have the following main operations that might need to be done (not counting updating variables for the next pixel, just to simplify things):

  • Use x coordinate to determine where in line buffer to write pixel
  • Use S and T coordinates to determine which sprite map tile to read
  • Read sprite map entry
  • Use sprite map entry to determine which sprite tile to read
  • Use S and T to and sprite tile to determine where to read pixel from
  • Read pixel
  • Use pixel and palette to find RGBA color in palette data
  • Multiply by RGBA scale color
  • Add RGB additive color
  • Blend to linebuffer address

In converting this to a state machine, we may find some of these can be done, and some take multiple steps. The end result might be something like this (this isn’t a real FPGA programming language, just pseudocode):

(“span” is the current-to-this-pixel data from the current span; “map” is the map selected by the current span)

  1. ; find the bits of S&T that index into the sprite map
    mapx := span.S[19..12] & map.S_mask
    mapy := span.T[19..12] & map.T_mask

  2. ; first half of computing the address of the sprite map index
    address := map.base + mapx
    address_row := (mapy << map.yshift)

  3. ; compute the sprite map index address and initiate a read of memory
    MEMORY.address := address + address_row
    MEMORY.read_8bits = 1

  4. [wait for read to complete; latch result into 'readresult']
  5. ; compute the address of the sprite data and the current pixel
    mapx := span.S[11..6] >> map.S_shift
    mapy := span.T[11.6] >> map.T_shift
    address_row := readresult << map.tile_shift

  6. ; continue above
    address := map.image + address_row
    mapx := mapx << map.pixel_shift
    address_row := mapy << map.pixel_rowshift

  7. ; continue above
    address := address + (mapx | address_row)

and etc.; the above is only the first half or third of the pixel pipeline. In this case, the state machine is simple; each state executes immediately after the previous one on the list (with pauses waiting for the external memory to return results).

One thing you’ll see in the above pseudocode is that I kind of haphazardly reused variables. Each variable is a “register” in the hardware, and the registers use up hardware real estate, so it makes a lot of sense to resuse the register for different purposes at each stage. In fact, a careful design will presumably try to minimize the total number of registers (and the total number of bits across all registers, since registers don’t all have to be the same size) to deal with this.

However, I didn’t bother for this because this aspect will go out the window soon.

If we ignore the wait for memory, we can look at this and see how long it will take. The processing of each pixel has to flow through each of these steps. It’s possible we might be able to do more work in each step (e.g. add more than two numbers) or have each do less (perhaps we can’t do as many shifts per clock as the given design assumes), and these things will influence the clock rate–if we do a lot of work in step #2, then step #2 may take longer than the other steps to run, but we have to set the clock (which is global across the entire design) to a single rate, i.e. slow down the overall clock. This is why the address calculations are split over multiple cycles.

So, as written, the entire pipeline will probably work out to around 15-20 cycles for a single pixel (of the most slow-to-decode kind of pixel). We can now do a back-of-the-envelope calculation for this. If we can get the FPGA to run at 50 Mhz, and we can process about 3 pixels in 50 cycles, then we can process 3 million cycles per second. We know we want to run at 60 frames-per-second, so that gives us about 50,000 pixels per frame. This is a problem–it takes too much time, too many cycles, to process each pixel. Now, if you use simpler pixels, it can be faster, but this will significantly limit how many complex pixels you can use.

A 320×200 screen is 64,000 pixels, and to make a reasonable videogame we have to drawn more pixels than there are screen pixels–we’ll fill the line buffer with a background, and then draw sprites on top of that. In the most extreme, we might like to have something like 3 layers of parallax scrolling, which amounts to needing to draw something like 3.25 pixels per screen pixel. At 320×200 that would be 208,000 pixels, which gives us only 4 cycles per pixel at the proposed clock rate (and who knows if the rate is achievable). The use of sprite maps is the single biggest thing using up cycles, and the use of sprite maps is pretty much mandatory for stuff like parallax scrolling. So saying “use mostly simpler sprite types” doesn’t help.

But remember what I said before: the magic comes from doing more than one thing at a time. The trick is to “pipeline” the above pixel processing so it’s like an assembly line; we can make it so each pixel still goes through 15 steps of processing, and thus takes 15 cycles from start to finish, but once it’s out of the first step of processing another pixel enters that step, and so on down the assembly line, so in the end the limit is how fast the conveyor belt is moving (the FPGA clock speed), not how many stations there are along the conveyor belt (the number of clock cycles needed to process each pixel).

One of the neat things about pipelining is that, at least for simple things, the hardware description of a pipelined process isn’t much different than the non-pipelined one. Next time I’ll take the pseudocode above and alter it into pipelined form.

pixel processing

Last time I described the way that a final image can be sensibly created by rendering a single scanline at a time.

To date I’ve been slowly working backwards from the final output signal, showing how each process produces a given output from the previous input. The last article described a process with the input a “list of depth-sorted active spans”, so that’s the next thing I need to show where it comes from to continue working backwards to what the CPU actually gives to the GPU.

However, I’m going to break off for a second and drill down into the span rendering. Last time I focused on how sorting works and how the span buffers get managed, but I didn’t really talk in detail where the individual pixels come from.

Each active span carries two kinds of data: horizontal data (data for rendering a span), and vertical data (data for updating a span for how it should appear in the next scanline). For this post, we’re just going to consider the horizontal data, since that’s what drives the span rendering, and hence the actual pixels.

Each active span has an x coordinate and a length; these determine which pixels in the line buffer are written to. Each sprite then has information describing where the sprite data itself is stored (discussed a ways below).

Sprites can be recolored. There can be a multiplicative RGBA value (each value from 0 to 1 multiplies the corresponding pixel color) and an additive RGB value (each value is added to the corresponding pixel value, and then the value is “saturated” to the max color value).

Sprites can be scaled; if they’re scaled, the span data includes a fixed-point dS/dx value, which stores the reciprocal of the scale factor. The renderer creates an “S” value which is a pixel index into the sprite. For an unscaled sprite, S is incremented after every pixel. For a scaled sprite, Whiz adds dS/dx to S after every frame. The integer part of S is then used to determine which pixel from the sprite to actually render. If dS/dx is less than 1, some integer S values will be generated more than once, causing some pixels to repeat, so the sprite is scaled larger. If dS/dx is more than 1, some integer S values will be skipped, effectively shrinking the sprite.

Sprites can be reflected horizontally; S is initialized to the last pixel in the sprite scanline, and S is decremented (or dS/dx is subtracted) after each pixel is rendered into the line buffer.

Sprite maps (see below) can be rotated. To do this, another value, T, is introduced. The source data is viewed as a 2-dimensional array of pixels, indexed by S and T. On each span, S is incremented by dS/dx and T is incremented by dT/dx after each pixel. In this case, the span also encodes an explicit starting S value and an explicit starting T value. This causes Whiz to walk “diagonally” through the sprite data. If you imagine a rotated object and a horizontal slice through that object, you’ll see that the slice goes diagonally through the object, so this is exactly what we want. (This actually allows for simultaneous rotation and scaling; rotation without scaling in this scheme requires placing a constraint on the values dS/dx and dT/dx can take on at the same time, but we have no reason to do this constraint ourselves since allowing rotation and scaling is great. The only thing we lose is you can represent rotation alone in less space, but there’s no way to represent it in less space and make Whiz efficient at handling them, so.)

Now, back to the sprite image data. In the simplest version, the span stores a pointer to 32-bit RGBA pixels which are the pixels for this span. This is a “direct image”.

Slightly more complicated case: the span stores a pointer to image data that consists of 8-bit palette entries. Whiz provides for 2048 palette entries; any one sprite uses a contiguous subset of 256 of those entries. (Thus the user can divide it into 8 palettes of 256 values, or 64 palettes of 32 values, or whatever.) To accomplish this, the span also stores a palette “base”. The paletted image pixel value is added to the base, and the entire thing is looked-up in the 2048 entry palette. (The span may not store a full 11-bit palette value; we may require base to be a multiple of 2 or 4 to reduce the number of bits necessary to encode the base.) The palette is stored in block RAM on the FPGA, so it’s cheap to access here.

An entirely different case is the “sprite map” case — this is equivalent to “tile mapping” or “character mapping” on other hardware. A sprite map consists of a 2D array of 1-byte values which are “sprite indices”. These sprite indices then index into a “sprite table” to choose a sprite.

To be more precise, each span stores a “sprite map index”. This is an index into an array of “sprite maps” which are stored in block RAM on the FPGA; each sprite map stores a number of values used for image decoding. (Note that this limits the number of unique sprite maps you can have in a single render, which may be limiting. If this data was stored per span it would be much more flexible, but it would take far too much memory.)

Each sprite map has a “tile size” (e.g. 4×4, 8×8, 16×16, 32×32) and a “map size” (e.g. 64×64, 128×128, 256×256). The effective number of pixels in the map is the product of the two on each axis. (But it wraps around outside of that region.) Given an S,T coordinates for a pixel to display, you split the S and T coordinates into the bottom bits (which index a single tile) and the rest of the bits (which index the map). You concatenate the S & T map bits to create a map index, and concatenate the S & T tile bits to create a tile offset.

There is a “map base” address; you add the map base and the map index and fetch a byte at that address from DRAM. (Actually, you keep the last DRAM fetch for this purpose and check if it’s the same one, so when multiple adjacent pixels come from the same tile, you don’t have to fetch the tile repeatedly.) You take that byte, multiply it by the size of each tile (which is the number of pixels per tile times the size of a single pixel, all of which are powers of two so the multiply here is really a shift), add it to the “tile start” address, add the “tile offset” defined above, and that gives you the address of the pixel. Then you fetch the actual pixel value. (If the pixel value is itself paletted, now you run it through the same palette processing described above.)

Regardless of the above steps, you now have an RGBA pixel. At this point you multiply by the RGBA multiplier for the span, add the RGB addend for the span, and composite the pixel into the appropriate spot in the line buffer, and move on to the next span.

The above flow can get fairly complex, especially in the tilemap case. To make this effecient the pixel processing is pipelined; I’ll talk about what this means at a later date.

Scanline synthesis

In the last few posts I’ve described the mechanism(s) by which a scanline synthesized by Whiz ends up being displayed to the screen.

Today I’ll start talking about how the synthesis works.

Whiz processes a scanline at a time into a “line buffer”. To do this, Whiz uses “painter’s algorithm” — it draws the scene (or rather, the parts of the scene that intersect a given scanline) in back-to-front order. Objects “in front” of other objects are drawn later, which means they overwrite the further-away objects, which means they obscure them.

Because the FPGA has multipliers available, Whiz can perform “alpha blending” (specifically, the “over” operator). The alpha value specifies a pixel that is partially transparent. If the pixel is transparent, some amount of the background will “show through” the pixel. This is useful for things like water and smoke effects, but it actually can offer a huge boost in visual quality by using it to implement “anti-aliasing”. If the edge pixels of a sprite are made transparent in a certain way, so the background shows through, it will appear visually equivalent to how the pixel would appear if the sprite actually had a more-defined edge within that pixel (e.g. the true shape ought to only actually cover about 2/3rds of a given pixel along the edge of the shape, so the pixel has an alpha value of 2/3rds which produces the right output).

Alpha-blending requires source data with alpha, so it may not be all that common. Storing the alpha value also takes up extra space. However, for paletted images it’s not a big deal, and since there’s no support for 24-bit images, only 32-bit images, it’s not a big deal in that case either. The only case where it’s significant is for 16-bit images, which go from being ARGB 1555 or RGB 565 to being ARGB 4444, which may be noticeable.

On the other hand, Joaf supports “dynamic recoloring”, in which each sprite can come with an RGBA 8888 value which multiplies each pixel before it’s displayed. In the case of alpha blending, this means that even if no sprite has alpha data, you can use this to fade the sprite in/out smoothly (as the multiplied alpha value goes from 1 to 0, the object fades out).

Anway, the basic algorithm for Whiz is to traverse an list of “spans” of pixels. Each span is the intersection of a sprite with a scanline. Whiz runs through the list, drawing that span into the line buffer, and then when it’s finished with the list, it’s finished with the scanline, so it can swap/flush the line buffer and start working on the next one.

In online mode, if Whiz doesn’t finish in time, Iris needs the buffer now, so I see two reasonable options. One is that Whiz abandons the buffer and advances to the next one. The problem with this is that since Whiz draws back-to-front, the stuff that wasn’t drawn will be the frontmost sprites, which are probably the most important ones. So having those disappear would really suck.

The second option is that Iris might redisplay the current line buffer (skip advancing it), and then Whiz can finish the one it’s on. This will cause a weird shifting; one scanline will get repeated, and then the next one will be displayed one scanline down. You wouldn’t want to repeat this (there’s no good way to catch up), so Whiz would then skip processing a scanline. If Whiz took too long processing the next scanline too (which seems likely), what you’d end up with is a region where the screen was displayed at half-resolution, and shifted down by an additional pixel.

I’m not sure which of those tradeoffs is more acceptable.

One possible improvement would be to switch from back-to-front rendering to front-to-back rendering. Front to back can be done in existing GPUs with “destination alpha” and produce the same result as back-to-front alpha blending. However, it gives up the possibility of a multiplicative blend mode (though I think you can still do additive). However, if you store a separate per-channel RGB destination “alpha”, I believe you can make back-to-front work and still support a multiply blend mode as well. This would require doubling the size of the rendered line buffer, but that’s not too bad since the other line buffers are unaffected. It also means you have to clear the line buffer every frame. This is a minor drawback; in back-to-front mode you don’t need to clear it, you can just require the user to display something that touches every pixel (since they usually will be), and then it lets demo coders do crazy stuff where they intentionally don’t clear the previous line buffer and thus get weird vertical smearing effects.

Doing the front-to-back thing would be novel and clever but it would still have its own weird artifacts when time ran out — is the background turning black on some scanlines really appreciably better than the foreground dropping out? Obviously games should be tuned so they never have dropouts, so what we do in this case probably doesn’t matter that much. But nobody can test every case, so we’re talking about what happens when something unexpected happens.

In old consoles, when there was “too much stuff” going on, they ran out of sprites, so they just multiplexed the sprites, making each object flicker because they only displayed every other frame. You can obviously do this in software in Joaf as well, but the problem is knowing that it happens, since there aren’t static sprite limits.

Probably I’ll just do whatever’s easiest first — I think that will be making Whiz bail and swap line buffers and start the next scanline — and see what that’s like.

online vs. offline

Last time I described how the graphics synthesizer Whiz sends graphics to the display generator Iris one scanline at a time via three line buffers. But there’s another reason why 3 is the magic number.

That post described “online” mode, in which the scene is synthesized just-in-time by Whiz to feed Iris. (Iris is always “just-in-time”, since Iris is generating the actual display and cannot fall behind without breaking the display.) The awesome thing about classic 2d game consoles that is also true in online mode is that your game just always runs at 60 hertz, because it pretty much has to. (Well, it could take longer than a frame to build the display list that drives Whiz, but it seems unlikely. At a minimum you could always have the player and some objects move at 60 hz; similarly, scrolling is cheap, etc.)

Because (as I’ll discuss in a future post) Whiz supports rotated sprites, it’s actually possible to use Whiz to generate Doom-style graphics (for the technically savant, the key concept is “affine spans”). However, for various DRAM/cacheing reasons, it seems unlikely that Whiz will be able to generate Doom-style scenes at 60fps, even for small screen sizes (e.g. 320×200, or whatever NTSC interlaced shakes out as).

So one of my innovations (I think) is that Whiz supports two modes: “online mode” being the mode described previously, and “offline mode” being one that supports rendering scenes at less than full frame rate.

As mentioned, Iris needs to always display at full frame rate. So the way offline mode works is there is a traditional fullscreen framebuffer. In offline mode, an offline-only part of Whiz is responsible for fetching the framebuffer and feeding it to Iris. To keep complexity down, Iris reads from one scanline buffer while this offline-only Whiz task writes to a second scanline buffer. When Iris finishes a scanline, it switches to the second buffer, and Whiz starts filling the first. (It would be possible to synchronize them more closely instead, but this shares more logic with online mode and allows Whiz to drive the DRAM at max speed to fetch a full scanline.)

Meanwhile, all of the rest of the Whiz circuitry is still present and active, synthesizing a scanline at a time into yet another scanline buffer (a third one). When this scanline is finished, it is flushed out to memory (to a second framebuffer–there need to be two framebuffers just as in normal graphics). Because this third line buffer isn’t ever sent directly to Iris, Whiz can take its time generating it, and end up generating at lower frame rates (30, 20, 15, variable, whatever).

Possibly there should be a fourth line buffer so Whiz can start rendering the next scanline without having to wait for the previous one to be flushed out to memory.

Regardless, one of the nice things about this design is that you can use exactly the same code to build a display list and feed it to Whiz whether you’re running offline or online. For example, you could have the game run at 60fps but in-engine cutscenes run at a lower frame rate, yet both use the same drawing technology (one is just more complex scenes). Or if you detect that you’re failing to hit 60fps you could switch over temporarily to offline mode.

Another consequence of this design is that as long as I specify a separate set of registers to determine the screen size Whiz is synthesizing in offline mode from those that Iris is displaying, Whiz becomes a “general purpose” blitter that can be used arbitrarily by the game to generate its display or even for other tasks.