A Failed Idea: Post-Mix Panning For 3D View Direction Changes

Sean Barrett, July 2012

Abstract

This article describes a technique that doesn't quite work for applying "listening direction" changes separately from other 3D mixing, specifically after mixing together multiple audio sounds. Even if this worked, it would be a solution in search of a problem. But it doesn't actually work.

Overview

In this article, I present a close-but-no-cigar failed technique that attempts to allow mixing 3d-positioned audio voices together, then performing additional 3D rotations of the mixed buffer.

It is not immediately obvious that this is actually useful. One scenario where it could be useful would be one in which all of the following are true:

The cost of mixing is significant (e.g. you are mixing thousands of 3D-spatialized sounds together)
Most mixed sounds' 3D doppler and distance attenuation effects are approximately constant or at least predictable (i.e. can ignore user input)
Mixed sounds' 3D rotation (pan) effect must be updated dynamically (e.g. due to mouselook, but only if failing to do so is audible)
You need to mix "ahead" (e.g. on a platform where the OS might disrupt the mixing thread, so you have to mix more than one tick's worth of audio, even though you're respatializing every tick, thus "wasting" lots of mixing work most of the time)

Mixing model

I assume there are three kinds of sounds: "2D" sounds which are not spatialized relative to a moving listener; "modifiable" sounds which must have their distance attenuation, doppler, or other non-pan effects updated regularly, or which can stop playing under interactive control; and "predictable" sounds, which have their distance attenuation, doppler, etc. set once when the sound begins playing, and which once started, cannot be stopped/canceled or otherwise modified.

I assume mixing occurs in floating-point or otherwise with wide range, with a final clamping to the output range after mixing is performed.

In the mixing model I propose, each class of sounds is mixed into separate buffers (actually, 2D and modifiable can be combined). A final mix step mixes these buffers together before the final clamp.

In this article I describe a mechanism that allows mixing all of the "predictable" sounds together, then applying a "rotation" step during the final mixing of the buffers before clamping. In the mix-ahead low-latency-rotation scenario, the predictable sounds are mixed significantly far ahead, but are never remixed. Each new predictable sound that is created is added into the existing predictable mix buffer. The "final mix" step must be executed for the entire premixed buffer every tick, but the final mix step is independent of the number of sounds being played.

The remainder of this article ignores the existence of the 2D and modifiable sounds.

We also ignore the need issues with avoiding clicks when changing parameter values at discrete points in time, since these are easily avoided by using fixups (crossfades or parameter sweeps) during the final rotation step.

Spatialization model

I assume that the non-directional aspects of spatialization (such as distance attenuation, doppler, and reverb) are independent of listener orientation, and so can be applied while mixing sounds together. I assume the effect of the viewpoint tilting up/down is ignored: or, equivalently, after mixing, all sounds are assumed to lie in a single plane around the viewer.

Panning model

Given a variable t which ranges from -1 (left) to 1 (right), one classic model is:

Left  attenuation = clamp( 1-t, 0,1)
Right attenuation = clamp(-1+t, 0,1)

which means that when t=0, Left = Right = 1.

The model I use here assumes t ranges from 0 (left) to 1 (right):

Left  = cos(t*pi/2)
Right = sin(t*pi/2)

This is the "equal power" model of panning (left^2 + right^2 is constant).

3D spatialization

Given a sound in direction D expressed in radians, we can then map that direction to the t value of the above equation. Assuming D=0 is centered, then we want D=-pi/2 to be t=0, and D=pi/2 to be t=1, so t = (D+pi/2)/pi.

Substituting this into the equal power model gives us:

   L = cos(D/2+pi/4)
   R = sin(D/2+pi/4)

However, attenuation values should always be positive, and the above will result in a negative attenuation values for directions behind the viewer. To avoid this, we take the absolute value:

   L = |cos(D/2+pi/4)|
   R = |sin(D/2+pi/4)|

Finally, we want to explicitly separate out view direction so we can alter it on the fly. Let the direction to the sound be D, and let V be direction of the viewer (listener), then:

   L = |cos(pi/4 + D/2 - V/2)|
   R = |sin(pi/4 + D/2 - V/2)|

This model applies only panning as orientation spatialization. There are other possible orientation spatialization effects, such as having sounds behind the listener be further attenuated or lowpass filtered. These effects can only be applied in a very coarsely approximate way, discussed later.

Simplified problem

With an unacceptable approximation

Suppose we consider the phase inversions produced by negative attenuation values to be acceptable (they're not).

We could then remove the absolute-value computations from our attenuation calculation:

   L = cos(pi/4 + D/2 - V/2)
   R = sin(pi/4 + D/2 - V/2)

We can then apply trigonometric sum identities:

   L = cos(pi/4 + D/2) * cos(-V/2) - sin(pi/4 + D/2) * sin(-V/2)
   R = sin(pi/4 + D/2) * cos(-V/2) + cos(pi/4 + D/2) * sin(-V/2)

This is just a simple 2D rotation by -V/2.

This makes the overall algorithm simple: mix each sound using into the "predictable mix buffer" using the weights:

   L = cos(pi/4 + D/2)
   R = sin(pi/4 + D/2)

Then, during final mix, apply the rotation by -V/2 based on the current listening direction. Because the rotation is linear, and because the sounds mixing is linear, this will correctly position all of the already-mixed sounds to match the viewer listening direction, except with incorrect negative attenuations.

Full problem

For which there does not appear to be a solution

The above solution does not work correctly because sometimes the attenutations are negative, and the phases of the sounds invert.

To correct this, we must force attenuations to be positive:

   L = |cos(pi/4 + D/2 - V/2)|
   R = |sin(pi/4 + D/2 - V/2)|

We can't achieve this with a simple rotation.

To address this, we'll attempt to split these into discrete cases. First, let's define the absolute values of trig functions assuming angles in the range 0 .. 2pi.

   |cos(a)| = a > pi/2 && a < 3pi/2 ? -cos(a) : cos(a)
   |sin(a)| = a > pi                ? -sin(a) : sin(a)

Tp perform the substitution, we first imagine defining a substituted variable:

   E' = pi/4 - D/2 - V/2

However, to simplify notation, we'll use 2x that:

   E = pi/2 + D - V

However, we must clamp E/2 to 2pi, so E must clamp to 4pi:

   E = (pi/2 + D - V) mod 4pi

   L = |cos(E/2)| = E/2 > pi/2 && E < 3pi/2 ? -cos(E/2) : cos(E/2)
   R = |sin(E/2)| = E/2 > pi                ? -sin(E/2) : sin(E/2)

We now express the conditions in terms of quadrants so we can understand it geometrically:

   Q=0 if    0   <= E/2 <   pi/2
   Q=1 if   pi/2 <= E/2 <   pi
   Q=2 if   pi   <= E/2 < 3*pi/2
   Q=3 if 3*pi/2 <= E/2 < 2*pi

   L = Q==1 || Q == 2 ? -cos(E/2) : cos(E/2)
   R = Q==2 || Q == 3 ? -sin(E/2) : sin(E/2)

The reason this is interesting is because we can potentially separately classify each sound into which quadrant it is, and then create separate mix buffers, one per quadrant. This would then allow the rotation step to correctly compute a "rotation" which produces a non-negative attenuation for the given mix buffer, since all the sounds in the mix buffer are in the same quadrant, and thus their sin()s and cos()s are consistently positive or negative (and thus so is their sum).

We can't actually do this, though, because the quadrants of the final mixed sound depend on both the source direction and the listening direction.

So now we need to simplify the equation so we can classify based on the sound and listening direction as simply as possible.

We modify E as follows: combine the pi/2 into D:

   D' = D + pi/2
   E = (D - V) mod 4pi

Substitute D' with D' mod 2pi, which won't affect the sound:

   D'' = (D + pi/2) mod 2pi

Add another 2pi to D'', which won't affect the sound:

   F = ((D + pi/2) mod 2pi) + 2pi

And now define E as:

   E = (F - V) mod 4pi

Since F now ranges from 2pi..4pi, and V ranges from 0pi..2pi, we can drop the mod:

   E = F - V

   Q=0 if    0   <= F-V <   pi
   Q=1 if   pi   <= F-V < 2*pi
   Q=2 if 2*pi   <= F-V < 3*pi
   Q=3 if 3*pi   <= F-V < 4*pi

Here F is a value that depends only on D, the source direction, which we assume is constant.

The problem here is that there is no further we can go. For any given value of F, there is some V for which a small V epsilon changes which quadrant (F-V)/2 is in, thus needing to introduce an extra negation. For two F's very near each other, which have been mixed into a shared buffer, the V at which point this transition occurs is different, but since they're already mixed and we must choose a specific negation, there's no way to avoid one or the other ending up with a negative phase.

What I was hoping was that you could classify the sounds into quadrants or octants or some such, even more finely. Although storing a separate mix buffer for each quadrant/octant would use more memory and require more mixing work, it would still be independent of the total number of sounds. It would also allows you to coarsely apply other effects like attenuation and low-pass filtering, but only on a per-quadrant/octant basis.

It's possible that if you divided sounds into octants, the amount of "inverted phase" you would get would be small enough that it wouldn't be objectionable, but I have no reason to think this is true and so I assume instead this technique doesn't work. (Clearly, if you divided directions into small enough sectors, eventually there'd be some division where it was unobjectionable, but that might involve so many mix buffers that it is slower than the naive implementation.)

Actually, I guess it's easier to set aside the math and understand it this way: there's a plane through the listener. Sounds that transition from in front of the listener, in the front halfspace to being behind the listener, in the back halfspace, must flip the sign of one of the two attenuations. Thus when the listener rotates, any objects which cross that plane cannot work with this mechanism. It is possible you could "unmix" them from the premixed buffer and remix them on the new side, but then you still pay a big cost on rapid listener orientation changes (but not as big as the naive implementation, since you only need to update sounds every 180 degrees).

Conclusion

Oh well, this doesn't work.

The plain rotation for the "simple problem" introduces phase inversions which are unacceptable for 3D spatialization. However, it is an interesting effect in the environment of music effects processing.

For example, given a mono siginal, a stereo delay line can be set up with a rotation in the feedback path, and the echoes generated by the delay will rotate around in stereo. As the echoes go through the quadrants, some of the time the echoes will have one channel with inverted phase, producing a distinctive but not objectionable stereo effect. Alternatively, only one channel of the delay can be output, producing a delay whose echoes fade in and out over time (some will be inverted, but this is inaudible in mono).