The Compositing Tree

How To Make A 2D Renderer With A Totally Different Performance Profile Than A 3D Renderer

Sean Barrett, April 2014

Although most discussion of GPUs is focused on 3D applications, the reality is that they are also used for 2D: 2D indie games, 2D industry games, 2D user interface in 3D games, and even operating systems, e.g. Windows Aero.

There doesn't seem to be much discussion of concerns that are unique to 2D amongst people who worry about making GPUs fast and making code that uses GPU fast, so I thought I'd change that. Given the number of people who say "don't do that! why would you do that?" when I raise some issue with why my 2D UI software uses the GPU in unexpected ways, it's clear there are at least some surprises here.

Back-to-Front

The most important thing about 2D graphics is that people's expectation of image quality is different.

We've tolerated poor anti-aliasing in 3D graphics since the early days of I, Robot and Ultima Underworld because there is so much visual pleasure in seeing those non-anti-aliased edges and texels move around in 3D. Of course, our tolerance for this is shifting now as GPU power and rendering algorithms have made the performance of real-time anti-aliasing more reasonable.

However, for a much longer time, we haven't tolerated aliased edges in 2D graphics, especially in 2D graphics rendered on GPUs.

That's because we can easily use the alpha channel to represent pixel coverage, and use alpha-blended compositing to simulate anti-aliasing. And GPUs are good at alpha-blending, so it's just assumed that if you're doing 2D graphics on the GPU, you'll do alpha-composited anti-aliasing.

How does that work? Here's how we normally think about it: we have a framebuffer f and an input color c with an alpha value α, and we update the framebuffer with

f' = f (1 - α) + c α

We have to perform a sequence of these operations, one per object being drawn. Each one appears "on top of" the previous ones. When we do this in 3D graphics, we call this "back-to-front rendering", since we draw the objects in the reverse order of their depth.

Layers in 2D

In 2D, we don't necessarily really have objects with depth. Typically we think of layers, like the layers in Photoshop— for a physical analogy, we can think of the layers we might see in something like a hand-drawn Disney cartoon, which used multiple physically-separated layers to achieve parallax and focus effects.

So to draw these layers we draw them from "back to front"— from most-occluded to least-occluded.

We could assign these layers depth values and attempt to draw them with a z-buffer anyway, but until hardware is capable of rendering "order-independent transparency", this is not viable; the alpha-blending happens in the order that we draw, not in the z-order, and this means we still need to draw everything in the "back-to-front" ordering.

This means we can't easily apply certain optimizations, like sorting draw calls based on shaders or textures; instead we have a fixed order in which we have to draw everything. Depending on what you're doing, this isn't necessarily a problem; in Jonathan Blow's Braid, most of the rendering of objects used the same shader and a small number of texture atlases, so the code just just runs through the objects in "layer" order and create quads in a pending vertex buffer using some texture atlas; if the next object to draw needs a different texture (or shader), the code issues the pending draw and starts a new vertex buffer using the new texture.

Instead of reordering the draw order, the main optimization opportunity is in trying to reduce the number of times a new vertex buffer is started, which is mainly about minimizing the number of atlases and choosing which atlas to place sprites in wisely.

This is a useful example of how 2D graphics are different. Sometimes graphics people will advise you about some technique, "don't do that, it's bad for performance", but what they actually mean is "don't do that, you'll break the GPU's early-depth-test system, which will slow down culling pixels that aren't visible due to depth-testing", but since these 2D graphics aren't using depth tests—every pixel we attempt to draw is drawn¹—their originally-expressed concern isn't relevant.

A good rule-of-thumb when doing regular 3D graphics may just not apply in 2D.

¹ It's actually possible to draw transparent objects front-to-back, and so it's theoretically possible hardware could do some kind of hierarchical rejection based on destination alpha and actually accelerate this case, saving pixel shading, ROP, and bandwidth in cases with significant opaque depth complexity, but as far as I know no hardware does, nor am I saying this would be "worth it".

UI Draw Order

In a 2D game, it may be straightforward to assign each object to some Disney-style compositing layer, but this is not how 2D user interfaces are normally structured. Typically, 2D user interfaces involve "widgets" which "contain" other widgets; for example, a window might contain a scrollbar, and the scrollbar contains several buttons and a draggable "thumb".

Our normal way of understanding this is as a hierarchy:

UI widgets can be nested deeper: inside regions, inside dialog boxes, scrollrects inside scrollrects, etc.

Most UI systems are designed so they do not need very much layering control by making sure that objects never overlap within a single container; in the above example, the only overlap is the thumb over the track. This means that in some UI systems, most of the time, the ordering constraints are minimal, just that the background elements must be drawn before the foreground elements—though of course the text or icon on a button means there's at least a few layers.

Even in such systems, though, if you can pop up a dialog box in front of other content, there needs to be some more sophisticated draw order control than just a set of small, fixed layers.

Hierarchical Draw Order

Suppose we have an arbitrary hierarchy of 2D objects that we need to draw in some order:

One obvious "solution" to this is just draw in some tree-traversal order. If we were lucky enough that the levels of the hierarchy represented the layers, a breadth-first traversal would be appropriate. But assuming that's not the case, the only other convenient option is a depth-first traversal. (If we assume graphics only appear in leaves, we can ignore "preorder" vs "postorder", as the results are the same either way.)

Unlike classic UI systems, the UI software I work on allows designers to author content in Flash, and that means my rendering model must match Flash's rendering model. Flash constructs a full object hierarchy like that shown (it's explicitly addressable and manipulable from Actionscript), but with hierarchical 2D transformations at every step, so all objects can overlap other objects. The draw order is well-defined, and is a depth-first traversal.

Even with the draw-order constrained by Flash to a linear order for every object, it is actually still possible to reorder the order objects are drawn in: if a given object doesn't overlap its draw-order neighbors then it can be shifted in the draw order. If we can shift two similar objects far enough in the order that they are adjacent, then we can reduce shader/texture changes or even combine them into a single draw call.

Regardless, my code definitely just does a depth-first traversal. Because this is what the Flash draw order is defined as.

Or is it?!?

The OVER operator

There's another way of defining how alpha blending works; instead of defining it as working on a framebuffer, we can define it as an "operator" that takes two inputs and outputs a third.

x over y = y (1 - x_α) + x

(Here I've switched to using premultiplied alpha, because the over operator takes an alpha-channel from both its inputs and outputs a new alpha channel—and that output color & alpha is already premultiplied, so it makes the inputs and outputs be the same "type", so we can feed the output of one to the input of another.)

As an operator, over can now be nested in more complicated expressions; just like we can write

x + (y+3) * z

we can write

x + (y * w) over z

although it's more normal to use other compositing operators, and, not having explained any, I'll just use

x over ((y over w) over z)

as a clear example.

Expression Trees and Compositing Trees

Expressions can be thought of as implicitly hierarchical, and this is very familiar to compiler writers; for example, the expression "x + (y+3) * z" can be encoded as:

and similarly "x over ((y over w) over z)" can be encoded as

If we draw a series of objects y₁..y_n into the framebuffer f in linear order, we can clumsily represent this as computing

... y₅ over (y₄ over (y₃ over (y₂ over (y₁ over f))))

which we can represent with the similarly clumsy (nearly left-skewed) binary tree

which we can call a "compositing tree".

To give a more concrete example, here's an example display hierarchy, and the compositing tree necessary created to render it assuming a depth-first traversal:

Do you suspect where this is heading now? Why I said "Or is it?"

A Useful Property of Over

A useful property of over is that is associative; that is,

(x over y) over z = x over (y over z)

This also means we can write "x over y over z" and it's unambiguous what we mean.

If we look at this as expression trees, this means

We can massage any tree of "overs" by a succession of rotations like that shown above, and a result, it means both of the following trees are the same

which in turn means that another way to express how to turn that sample Flash display hierarchy into a compositing tree would be:

although I've cheated here by switching from a depth-first traversal that's left-first to a depth-first traversal that's right first (to use the original left-first depth-first traversal and keep the same tree shape, you'd have to use the under operator instead of the over operator).

And it turns out that this is the definition of the Flash rendering model. Dun Dun Dun!

As an optimization, we rotate the nodes of the tree into the left-skewed form, but that's because what GPUs are efficient at computing is

new-framebuffer = transformed-but-essentially-static-single-object over old-framebuffer

If we were to try to compute directly from the more complicate tree shape, we would need somewhere to store the intermediate results, because the framebuffer implementation of over can't directly accept a complicated compositing subtree as its left input (whereas the right input is naturally the result of a complicated compositing subtree which was previously output to the framebuffer). Concretely, we would need additional rendertargets to write those results into, and this would be less efficient and use more memory. This is a familiar problem from compiler optimization, especially code-generation, where you may want to minimize the number of temporary registers needed to compute an expression.

(By the way, I've glossed over the handling of nodes with more than two children. That's because while the over operator is binary, due to associativity it's unambiguous what "x over y over z" means, and it's still clear how to render a node with three children. In practice, you'll have to turn that into a little two-over structure favoring one direction or the other, but which you choose doesn't matter for this discussion.)

The Flash Rendering Model

The reason why we can transform from the original-hierchicy tree into the left-skew tree is because the tree only contains the over operator, and over is associative. If we were to put something into the tree which wasn't associative with over, we wouldn't be able to "rotate" the tree at that node.

Flash provides compositing operations that are not associative with over.

And this is why I say that the hierarchical compositing model is the Flash rendering model, rather than "depth-first-traversal framebuffer blending". Without those extra operations, the two models would be mathematically equivalent, and there'd be no good reason to favor one over the other.

Here's an example in Flash that seems to favor the depth-first model: if you assign an alpha value to an object, it affects that object and all the descendents of that object. Each one is drawn separately with that alpha value. If you put an opaque object in front of another opaque object so the second one is hidden, and set the alpha value on an ancestor of both objects, they will both receive that alpha value, and both be drawn independently with that alpha value, and the second object will become visible through the first object.

This is probably not what you really want if you try to fade out an object, but it is the default meaning (which turns out to be convenient for GPU efficiency).

However, you can set a special mode on the ancestor object (the filter mode "Layer") which tells Flash that it needs to apply the alpha to the object's descendents "all at once". If you do this, the first opaque object will become transparent without exposing the second object. In GPU terms, we draw the entire subtree of that ancestor to a rendertarget, then we composite the result to the framebuffer with an extra alpha value multiplied in.

Flash Filter Effects and Blend Modes

Flash has a number of "filter" effects, raster effects that may be familiar from Photoshop, such as drop-shadows, blurs, faux-bevel effects, etc, which my code implements using GPU shaders. Each of these affects the entire subtree of the object they're applied to; in other words, each of these intrudes on the tree as a non-associative operator—that is, each of these requires rendertargets. (Of course we provide a separate fast path for rendering drop-shadows under text without needing to use GPU shaders, but for non-text objects the GPU shader is the only available method.)

Additionally, Flash has a number of blend modes, also borrowed from Photoshop. These are also not associative with over and so similarly prevent tree rotation. (Because most of the hardware we target doesn't support programmable blending, we have to handle this by getting the target framebuffer data into a texture. We actually do this the "dumb" way, by copying the framebuffer into a texture just-in-time, rather than the "smart" way, by having already redirected the entire right side of the operator into a texture, because if we did the latter we'd have to fill (composite) the entire framebuffer since the output target would be empty. The "smart" way is more efficient for large blends, but the "dumb" way is actually more efficient for small blends.)

Rendering Non-Associative Compositing Trees on the GPU

The above shows a hypothetical display/compositing tree with some of the nodes colored black if they have non-associative operators, and a roughly corresponding left-skew compositing tree where all of the over operators have been rotated into the preferred form of blending into a "framebuffer". (I'm only bothering to show the operators and not the leaves with the graphics on them, and I didn't bother making the converted tree super-accurate.)

Real trees produced in Flash do not normally have that density of non-associative compositing operators. However, they may have 500, 1000, even 2000 nodes in the tree. As a result, some people using this software will have many more total non-associative operators than the example above.

Each non-associative node requires an extra rendertarget into which to compute and temporarily hold onto the value of its left subtree.

Rendertarget Usage Patterns

Normal use of rendertargets on the GPU is to draw each rendertarget sequentially to completion, e.g. something like this:

Select rendertarget 1, clear it
Draw into it
Select rendertarget 2, clear it
Draw into it
Select rendertarget 3, clear it
Draw into it, using targets 1 and 2 as textures

This is the approach for, for example, drawing shadow depth buffers before drawing objects; for drawing into g-buffers and then doing deferred shading; or for applying post-processing effects.

Here is a less common model:

Select rendertarget 1, clear it
Draw into it
Select rendertarget 2
Draw into it using target 1 as texture
Select rendertarget 1, clear it
Draw into it
Select rendertarget 2
Draw into it using target 1 as texture
repeat

Note that while we clear rendertarget 1, we're slowly building up an image in rendertarget 2 without clearing it. An example of doing this in 3D rendering is creating shadow depth buffers for deferred shading; each light computes a new depth buffer, but we can compute them sequentially in the same rendertarget, applying them successively to accumulate the final result in the main "framebuffer".

Non-Associative Compositing Trees Rendertarget Usage

In our UI system, we could give every instance of a non-over operator in the tree its own rendertarget, and proceed as in the first approach above; this would minimize the number of rendertarget changes, and especially the rendertarget changes without clears (there would be none). However, this could require an unpredictably large amount of rendertarget memory, and allocating the variably-sized rendertargets would be tricky and probably a source of performance problems (consider the changing bounding-box of a rotating object, and keep in mind that we can't predict how objects will move, given that they are controlled by Actionscript).

We instead use the second approach. Much as compilers can try to minimize the number of registers needed when generating the code to evaluate an expression, we can try to minimize the number of rendertargets needed to composite the scene. We do this in a naive, greedy way, as (a) this whole system was designed before I understood this compositing tree model, and (b) we can't afford to analyze a 1000-tree node explicitly anyway.

The greedy algorithm we use basically ends up needing a number of rendertargets equal to the max count of non-over nodes from any leaf to the root in the original tree. This also results in worst-case twice as many rendertarget changes as the first approach would have used (in the typical case, we have to switch back to the framebuffer after we finish drawing to a rendertarget).

We might need 1-2 rendertargets to composite most of the Flash rendering hierarchies we see, since non-over operators are rarely nested. (While I call these a small number of rendertargets, they are not an insubstantial amount of memory on last-generation consoles; we need an extra rendertarget for a few other things like blurred dropshadows, and 4 total 32-bit 1080p rendertargets is 32MB.)

To avoid problems with variable-sizing, we simply require each of the rendertargets to be fullscreen-sized rendertargets, so they are reusable at any needed size. Of course in the worst case this could require more rendertarget memory than is needed for the other approach (perhaps hundreds of small rendertargets could all fit in the space of two large ones), but in other cases it requires less, and it's much easier for users of the software to predict how much memory will be needed.

When rendering, the code tracks bounding boxes and when reusing old rendertargets it only needs to clear the bounding box. (We do use the zbuffer for some special trickery for drawing Flash vector graphics, although we've set it up so that we don't need to clear it when changing rendertargets; however, on OpenGL, which doesn't allow reusing the framebuffer's depth buffer as a rendertarget's framebuffer, we actually have to clear both color and depth when starting up a "new" (reused) rendertarget. Thus we may issue an unexpectedly large (by 3D renderer standards) number of depth-and-color clears in OpenGL.)

Rendertargets and Mobile GPU Bandwidth

The upshot of the above is that, for some Flash files, we may switch rendertargets dozens of times. In the common case, every alternate time we switch, we switch back to the 'main' framebuffer and we don't clear it, as described above.

This is not a workload GPUs creators have generally been expecting. For example, this performs terribly on most mobile hardware.

Alpha-blended rendering requires a lot of bandwidth, as the values in the framebuffer must be read and written for every pixel. While desktop hardware solves this by using very fast GDDR-based memory and simply providing all the bandwidth we need, most mobile hardware addresses this by putting the framebuffer or current rendertarget in a special RAM buffer on-chip; the on-chip-ness means the bandwidth is essentially free. Because there is a limit to how much RAM they can squeeze into the chip, they treat that RAM as a small "tile" of the framebuffer/rendertarget. They cache all of the draw commands for a frame, and play them back repeatedly for each tile of the display.

Since what's drawn in rendertarget A can be used as an input when drawing rendertarget B, it's a little more subtle: to a first approximation, all the draw calls between a pair of rendertarget changes are buffered, and then played back for each tile of that rendertarget, and then this is repeated up to the next rendertarget change, etc. At the start of each tile, it's necessary to load the previous data for the rendertarget from memory (unless it begins with a clear), and when finished with the tile, write it back to memory. These require main memory bandwidth, but none of the draw calls themselves require main memory bandwidth for blending to the rendertarget.

However, in the worst case of the Flash model, we might change rendertargets after every individual draw call. This means that we would have to read and write every (relevant) tile every time we draw, and that puts us back to using the same amount of bandwidth as the desktop GPU does (or possibly more due reading/writing whole tiles).

Of course, we don't ever even come close to that worst-case; over predominates. But we do see incredibly divergent performance between desktop parts and mobile parts; the more artists use non-over operators, the greater the divergence.

(A new feature of some mobile tiled GPUs is some extra per-pixel storage that lives in the same tile RAM. We could potentially use this as our "register" temporary storage for our non-over blend modes, because these always map input & output pixels to the same spot in the rendertargets, i.e. we don't address the rendertargets arbitrarily. However, for filter effects that do offset pixel sampling, like dropshadows and blurs, this will not work.)

Our current solution is just to say "avoid using filter effects and blend modes on mobile platforms". And that works as process; it's a method by which we can live with the technology difference between the GPUs. But just because we have a process workaround doesn't mean it's not a problem! It's still a technology divergence; a rendering approach well-supported on one class of GPUs is terribly supported on another class, and it's a problem that makes me nervous when I hear people discuss possibly ending up with tile-based GPUs on the desktop.

Conclusion

So yeah. Some 2D renderers can have very different performance profiles than what people used to 3D renderers might imagine.

home