Since this is a reverse-chronological blog, here’s a list of the sAGAs posts so-far, in correct order. I highly recommend reading these theory posts in order, not reverse order.
1. sAGAs – Milkymist
2. sAGAs – FPGA synthesizer
3. sAGAs – Synthesis
4. sAGAs – Oscillator design
The design for sAGAs is primarily limited by the number of “DSP” multipliers available on the FPGA. The design proceeds by trying to keep those multipliers in use every cycle, and adding whatever logic is needed to try to achieve that.
The core sAGAs unit is called the “oscillator”. The oscillator is capable of computing two interpolated-table-lookups each cycle and computing two more multiplies (in various configurations), and summing the results with a buffer stored in on-chip RAM. (It can also compute an IIR biquad filter, which requires five multiplies when using low-precision fixed-point multiplies, so we actually use five multiply units and waste one when we’re not doing biquads.)
Now, the system was originally designed primarily around the faux-analog band-limited synthesis problem, and it turns out to be pretty wasteful of multipliers for that. So in practice what we end up with is the oscillator tries to leverage multipliers as best it can (even if some configurations will use zero of them), but then the rest of the system is designed to keep the core oscillator pipeline itself from ever stalling, if not the multipliers.
So, let’s look at how the oscillator design evolved as I added synthesis methods.
- MinBLEP – To do our “analog band-limited synthesis”, we need to be able to “draw straight lines” and accumulate them in a buffer, and then add “MinBLEP” and “MinBLAMP” fixups into the buffer as well. The fixups are implemented as interpolated table lookups times a constant. The interpolated table lookups require a DDA-style iterated fixed-point variable used as the address. For “draw straight lines”, we just need to use the iterated fixed-point variable itself as the value to put in the audio buffer, instead of the lookup.
- Granular – Granular synthesis requires grabbing sections of PCM data, applying a window to it, and adding it to a buffer. We can use the interpolated-table-lookup from above for the PCM data, and then we need another table-lookup to access the window. For flexibility, we should probably make that one an interpolated lookup as well, so a single window table can be resized. So now we require two interpolated lookups multiplied by each other–so one multiply per lookup, plus one to multiply them together. (The same hardware can also be used for ring modulation of PCM waveforms.)
- What about FM synthesis? To do FM synthesis, we need to feed the output of one interpolated table lookup into the input of the second one. To do this, it turns out we need a fourth multiplier (we need to scale the first table lookup before using it for addressing, and we need to scale the second table lookup for output to the buffer), but otherwise it doesn’t take much hardware. Unlike the other synthesis methods, I don’t know how to avoid aliasing here, so this one is speculative.
- For additive synthesis, we can use the table lookup for sine waves. Can we use both table lookups? We don’t want to multiply them or feed one into the other, we just want to add them. So each needs its own scalar–so we need four multiplies total, same as FM.
Also, we need to be able to control the amplitude of each sine wave independently, and they need to be continuous across audio frames, so we need to be able to ramp the amplitudes. That means the scalars we multiply by need to themselves be DDA walks.
(My original thought was that avoid discontinuities as the amplitude of partials changed could be done by mixing, say, 16 samples at the old amplitude, running that through the mono-to-stereo mixer with a 1-to-0 ramp, then mixing the full frame of samples with the new amplitudes, then using the mono-to-stereo mixer with a 0-to-1 ramp for the first 16 samples, then just a multiply-by-1 for the rest of the samples. But I decided that seemed more clumsy and wasteful compared to just making everything DDA stepped. Note that also I was originally thinking that since the on-chip buffers will have 256 samples, you would want to not waste them, and use something like 200 samples per frame, which meant that computing the 200-samples DDA step size would be expensive (a divide). But for latency you want shorter frames anyway, so now I’m assuming you’ll use 64-sample frames (or 128-sample frames if you’re oversampling), so now the DDA step size is just a shift, or possibly not even a shift (I may make the step size input be shifted by 6 bits internally, so the CPU feeding it can just subtract end-start).
- Waveshaping can be done with the FM path (compute a waveform, then run that waveform through a table lookup).
- Because we can turn on and off whether each of the two table lookups happens, we can actually draw two lines simultaneously, one with each lookup. Adding these together is pointless (the result is still a line), but multiplying them produces a quadratic–but in an annoying form for specifying (you have to factor the quadratic). However, since we forced the other multiplication factors to ALSO be DDAs, we can actually express a more convenient quadratic by computing two lines multiplied together to only be the ax^2 term, and then another line to be the bx+c term. This isn’t necessarily useful at all, but it’s basically free at this point (we just need a few extra control signals in the hardware, no extra multipliers or adders)
To achieve all of this, the core oscillator unit is pipelined (around 10 stages deep), and it uses many separate RAM blocks to make sure it has all the memory bandwidth it needs to keep the spice data flowing.
This pipeline is what that circuit diagram I posted before is showing (and it includes the biquad pipeline, which shares the multipliers as well):

Control signals haven’t been pipelined yet.
If you look at it full-size, here’s the deal: the bottom-most register (on the left) controls whether the biquad pipe or the synthesis pipe is active. If the biquad pipe is active, all the multipliers compute A*B+C, and C is always fed in as the sum from the previous multiplier. This computes a Direct Form I biquad, which is best for low-precision multipliers (especially with higher-precision adds), as here where Spartan-6 gives 18×18 multiplies and 48-bit adds. (It’s best because in direct form, you only ever form products of your filter coefficients with input values and output values, never intermediate temp values. If the output values don’t clip (you normally don’t want them to clip!), this means you can use all the available precision for the multiplies.
The multiplier units have optional internal registers, and for constistency I’ve enabled them for the B and D inputs, which otherwise have the highest-latency from input to output of the multipliers. Most people don’t try to pump FPGA biquads at one output per cycle, because you have to feedback the output result back into the last multiplier (last two, actually), and that can become a critical path. So the recommendation is to emit them every other cycle (or such), and crank the clock rate up as high as possible. Since Milkymist is only clocked at 100Mhz, however, I don’t have the option to crank the rate. The latency of the A input to the multiplier output (register) is 6.40ns, so I’m assuming that the FPGA can manage to route that output back to the input in 3ns and meet the timing requirements.
I may expand the above discussion and move it to a separate post, so don’t be surprised if it shows up again.
As I mentioned previously, one flaw of this design is that the MinBLEP path, which was the original motivation for this whole system, actually only uses two multipliers — one for the interpolated table lookup and one to rescale it to the right magnitude, exactly the same as is used for, say, additive sine-wave synthesis. But unlike additive, I can’t compute two of these at the same time and add them, because they’re offset in time. (It could be done if there was padding at each end of the table, and I use the extents of both, but that’s inefficient and kind of gross. I have some other ideas for things to try, but they’re probably too expensive for the CPU.) An alternative strategy would be to add in even more hardware so that the two scaled interpolated lookups can be added into two different positions in the buffer, or, since this might be more likely efficiently implementable, two different buffers. Unfortunately, neither of these seems very feasible; the former requires RAM with two write ports and two read ports, which can’t be efficiently implemented with FPGA block RAM; the latter, by doubling the number of oscillator “voice buffers”, requires a massive increase in muxes for the other components that read/write those buffers, the details of which I’ll discuss more in the next installment.
3 comments ↓
“the former requires RAM with two write ports and two read ports, which can’t be efficiently implemented with FPGA block RAM;”
I ran into the exact same problem, what I ended up doing was pipelining the operations on the block ram and double clocking the block ram. I found this to be much faster than the multiplexer approach when I was making my project. However in my case I needed to do 8 reads and 4 writes a clock.
Makes sense. How hard was it to clock the block RAM higher than the rest of the chip? That seems like advanced stuff I want to stay away from for now.
I found it was surprisingly easy. You just use one of the PLLs to generate a clock signal that is twice the frequency of the main clock signal and feed it to the block ram clock in. Then alternate between the two needed positions every clock of the faster clock, for example…
wire clock, clock2;
wire [14:0] ram_read_addressA;
wire [14:0] ram_read_addressB;
wire [31:0] ram_read_dataA;
wire [31:0] ram_read_dataB;
reg [14:0] ram_address;
wire [14:0] ram_data;
bit state;
//etc
pll p1(.in_clk(clock),.o_clk(clock2); //Some xilinx specific pll
always@(posedge clock2) begin
state<=!state;
if(state) begin
ram_address<=ram_addressA;
ram_read_dataB<=ram_data;
end else begin
ram_address<=ram_addressB;
ram_read_dataA<=ram_data;
end
end