This article describes a technique that doesn't quite work for applying "listening direction" changes separately from other 3D mixing, specifically after mixing together multiple audio sounds. Even if this worked, it would be a solution in search of a problem.But it doesn't actually work.

It is not immediately obvious that this is actually useful. One scenario where it could be useful would be one in which all of the following are true:

- The cost of mixing is significant (e.g. you are mixing thousands of 3D-spatialized sounds together)
- Most mixed sounds' 3D doppler and distance attenuation effects are approximately constant or at least predictable (i.e. can ignore user input)
- Mixed sounds' 3D rotation (pan) effect must be updated dynamically (e.g. due to mouselook, but only if failing to do so is audible)
- You need to mix "ahead" (e.g. on a platform where the OS might disrupt the mixing thread, so you have to mix more than one tick's worth of audio, even though you're respatializing every tick, thus "wasting" lots of mixing work most of the time)

I assume mixing occurs in floating-point or otherwise with wide range, with a final clamping to the output range after mixing is performed.

In the mixing model I propose, each class of sounds is mixed into separate buffers (actually, 2D and modifiable can be combined). A final mix step mixes these buffers together before the final clamp.

In this article I describe a mechanism that allows mixing all of the "predictable" sounds together, then applying a "rotation" step during the final mixing of the buffers before clamping. In the mix-ahead low-latency-rotation scenario, the predictable sounds are mixed significantly far ahead, but are never remixed. Each new predictable sound that is created is added into the existing predictable mix buffer. The "final mix" step must be executed for the entire premixed buffer every tick, but the final mix step is independent of the number of sounds being played.

The remainder of this article ignores the existence of the 2D and modifiable sounds.

We also ignore the need issues with avoiding clicks when changing parameter values at discrete points in time, since these are easily avoided by using fixups (crossfades or parameter sweeps) during the final rotation step.

Left attenuation = clamp( 1-t, 0,1) Right attenuation = clamp(-1+t, 0,1)

which means that when t=0, Left = Right = 1.

The model I use here assumes t ranges from 0 (left) to 1 (right):

Left = cos(t*pi/2) Right = sin(t*pi/2)

This is the "equal power" model of panning (left^2 + right^2 is constant).

Substituting this into the equal power model gives us:

L = cos(D/2+pi/4) R = sin(D/2+pi/4)

However, attenuation values should always be positive, and the above will result in a negative attenuation values for directions behind the viewer. To avoid this, we take the absolute value:

L = |cos(D/2+pi/4)| R = |sin(D/2+pi/4)|

Finally, we want to explicitly separate out view direction so we can alter it on the fly. Let the direction to the sound be D, and let V be direction of the viewer (listener), then:

L = |cos(pi/4 + D/2 - V/2)| R = |sin(pi/4 + D/2 - V/2)|

This model applies only panning as orientation spatialization. There are other possible orientation spatialization effects, such as having sounds behind the listener be further attenuated or lowpass filtered. These effects can only be applied in a very coarsely approximate way, discussed later.

We could then remove the absolute-value computations from our attenuation calculation:

L = cos(pi/4 + D/2 - V/2) R = sin(pi/4 + D/2 - V/2)

We can then apply trigonometric sum identities:

L = cos(pi/4 + D/2) * cos(-V/2) - sin(pi/4 + D/2) * sin(-V/2) R = sin(pi/4 + D/2) * cos(-V/2) + cos(pi/4 + D/2) * sin(-V/2)

This is just a simple 2D rotation by -V/2.

This makes the overall algorithm simple: mix each sound using into the "predictable mix buffer" using the weights:

L = cos(pi/4 + D/2) R = sin(pi/4 + D/2)

Then, during final mix, apply the rotation by -V/2 based on the current listening direction. Because the rotation is linear, and because the sounds mixing is linear, this will correctly position all of the already-mixed sounds to match the viewer listening direction, except with incorrect negative attenuations.

To correct this, we must force attenuations to be positive:

L = |cos(pi/4 + D/2 - V/2)| R = |sin(pi/4 + D/2 - V/2)|

We can't achieve this with a simple rotation.

To address this, we'll attempt to split these into discrete cases. First, let's define the absolute values of trig functions assuming angles in the range 0 .. 2pi.

|cos(a)| = a > pi/2 && a < 3pi/2 ? -cos(a) : cos(a) |sin(a)| = a > pi ? -sin(a) : sin(a)

Tp perform the substitution, we first imagine defining a substituted variable:

E' = pi/4 - D/2 - V/2

However, to simplify notation, we'll use 2x that:

E = pi/2 + D - V

However, we must clamp E/2 to 2pi, so E must clamp to 4pi:

E = (pi/2 + D - V) mod 4pi

L = |cos(E/2)| = E/2 > pi/2 && E < 3pi/2 ? -cos(E/2) : cos(E/2) R = |sin(E/2)| = E/2 > pi ? -sin(E/2) : sin(E/2)

We now express the conditions in terms of quadrants so we can understand it geometrically:

Q=0 if 0 <= E/2 < pi/2 Q=1 if pi/2 <= E/2 < pi Q=2 if pi <= E/2 < 3*pi/2 Q=3 if 3*pi/2 <= E/2 < 2*pi

L = Q==1 || Q == 2 ? -cos(E/2) : cos(E/2) R = Q==2 || Q == 3 ? -sin(E/2) : sin(E/2)

The reason this is interesting is because we can potentially separately classify each sound into which quadrant it is, and then create separate mix buffers, one per quadrant. This would then allow the rotation step to correctly compute a "rotation" which produces a non-negative attenuation for the given mix buffer, since all the sounds in the mix buffer are in the same quadrant, and thus their sin()s and cos()s are consistently positive or negative (and thus so is their sum).

We can't actually do this, though, because the quadrants of the final mixed sound depend on both the source direction and the listening direction.

So now we need to simplify the equation so we can classify based on the sound and listening direction as simply as possible.

We modify E as follows: combine the pi/2 into D:

D' = D + pi/2 E = (D - V) mod 4pi

Substitute D' with D' mod 2pi, which won't affect the sound:

D'' = (D + pi/2) mod 2pi

Add another 2pi to D'', which won't affect the sound:

F = ((D + pi/2) mod 2pi) + 2pi

And now define E as:

E = (F - V) mod 4pi

Since F now ranges from 2pi..4pi, and V ranges from 0pi..2pi, we can drop the mod:

E = F - V

Q=0 if 0 <= F-V < pi Q=1 if pi <= F-V < 2*pi Q=2 if 2*pi <= F-V < 3*pi Q=3 if 3*pi <= F-V < 4*pi

Here F is a value that depends only on D, the source direction, which we assume is constant.

The problem here is that there is no further we can go. For any given value of F, there is some V for which a small V epsilon changes which quadrant (F-V)/2 is in, thus needing to introduce an extra negation. For two F's very near each other, which have been mixed into a shared buffer, the V at which point this transition occurs is different, but since they're already mixed and we must choose a specific negation, there's no way to avoid one or the other ending up with a negative phase.

What I was hoping was that you could classify the sounds into quadrants or octants or some such, even more finely. Although storing a separate mix buffer for each quadrant/octant would use more memory and require more mixing work, it would still be independent of the total number of sounds. It would also allows you to coarsely apply other effects like attenuation and low-pass filtering, but only on a per-quadrant/octant basis.

It's possible that if you divided sounds into octants, the amount of "inverted phase" you would get would be small enough that it wouldn't be objectionable, but I have no reason to think this is true and so I assume instead this technique doesn't work. (Clearly, if you divided directions into small enough sectors, eventually there'd be some division where it was unobjectionable, but that might involve so many mix buffers that it is slower than the naive implementation.)

Actually, I guess it's easier to set aside the math and understand it this way: there's a plane through the listener. Sounds that transition from in front of the listener, in the front halfspace to being behind the listener, in the back halfspace, must flip the sign of one of the two attenuations. Thus when the listener rotates, any objects which cross that plane cannot work with this mechanism. It is possible you could "unmix" them from the premixed buffer and remix them on the new side, but then you still pay a big cost on rapid listener orientation changes (but not as big as the naive implementation, since you only need to update sounds every 180 degrees).

The plain rotation for the "simple problem" introduces phase inversions which are unacceptable for 3D spatialization. However, it is an interesting effect in the environment of music effects processing.

For example, given a mono siginal, a stereo delay line can be set up with a rotation in the feedback path, and the echoes generated by the delay will rotate around in stereo. As the echoes go through the quadrants, some of the time the echoes will have one channel with inverted phase, producing a distinctive but not objectionable stereo effect. Alternatively, only one channel of the delay can be output, producing a delay whose echoes fade in and out over time (some will be inverted, but this is inaudible in mono).