The Frustration of Seeking in Ogg Vorbis I'm not qualified to say whether the Ogg specification qualifies as "amateur hour". I am qualified to say that the facilities provided for seeking are extremely insufficient to the task of a reasonable implementation, and that it would not cost much space to do better. Ogg vorbis audio streams are divided into audio "frames" which are typically about 250 or 1000 samples long (each stream uses frames of two different lengths, intermixed arbitrarily). The entire stream is divided up into these frames. Complicating things, the frames overlap each other, normally by 50%, so that any actual output audio sample is actually a weighted combination of samples from two frames. (The exception is when a short frame and a long frame are adjacent; in some sense the short frame is extended with 0s, but in another sense there's just a bunch of samples from the long frame which aren't overlapped.) Frames can encode to a fairly small amount of data; at a 10:1 ratio 250 stereo 16-bit samples compress to only 100 bytes. So ogg vorbis doesn't provide seeking information at the frame level. Instead, collections of frames are organized into "pages". The file is divided into pages, and pages are divided into frames. Each page is variable width and there's no index of pages, so instead each page begins with a 4-byte signature, and includes a CRC32 for its contents (so that you can verify that a 4-byte signature is actually the beginning of a page and not just compressed audio data that happens to be those 4 bytes). Pages have explicit timing info. (To determine the timing for a given frame, you have to get the time for the page and then extrapolate the time for the frame.) Specifically, each ogg page stores an 8-byte integer number which is the "time"; in ogg vorbis, this time is measured in audio samples. So, it's pretty straightforward to take an ogg vorbis file and seek around in it and search and find a page and extract the time from that. Then you can do a "binary search" (or better yet an 'interpolation search') and locate a page near the sample you're attempting to seek to. You can then start playing from there, and with a typical page size (4KB), you've just seeked to within a quarter second of your goal. But ideally what you'd like to do is sample-accurate seeking. It would be nice to be given a target sample number N, to seek, and guarantee that the very next sample you decode is sample number N. You'd think that that would be easy, right? Well, if you're making a media player, and you're assuming a user is seeking, you can probably afford to do some extra decoding. So it's not hard; rewind to a page or two before the target sample, and start decoding, and throw away all the data before the target sample. Voila! But what if you don't want to waste processing power decoding unnecessary data? Maybe you're the ogg vorbis library for a game which is trying to play music backwards. Or you've encoded a bunch of separate samples in a file and want to seek and decode them and you need to minimize overhead since you're doing a bunch of them. Then you start running into problems. 1. The "sample location" in each page (that Ogg calls the "granule position") is defined as being the sample number of the last-decoded sample on the page.[*] [*] I'm not even positive what this means yet (I don't have sample accurate streaming working), but it's ambiguous because remember every output sample is summed from two overlapped frames. Thus the last sample from a given frame is not really a 'decoded sample', since it must be summed with a sample from the next frame before it can be played back. 2. Except it's not for the last page of the file; it has a special duty there. (You can tell you have the last page with a flag in the page header.) 3. And it's also not for the first page of the file; it has a different special meaning there. (You can tell you have the first page from another flag in the page header.) [**] [**] If your ogg vorbis stream has one page, the granule position now has two conflicting meanings. Whoops! Apparently if you use the special meaning on the first page, you're supposed to have a minimum of two pages. EFFICIENT IMPLEMENTATION #1 Given a page, and the "last decoded sample" in it, you can work out the "first decoded sample" in it by grovelling through each of the frames in the page to see how long it is, and using the rules for overlapping windows and such to decipher how long the page is overall. With this information, while you're binary searching, you can tell if a given sample falls within a page and stop the binary search because you know everything you need to know about the page. Except this doesn't work for the last page, where you _can't_ figure out the first decoded sample from the last decoded sample, because of the special meaning: the granule position in the last page is defined as being the length of the whole file in samples. In other words, the raw decoded data must end up being a multiple of the framesize (well, a sum of multiples of the two frame sizes, and I'm ignoring the overlapped window again). To allow the decoded stream to not be such a multiple (e.g. for gapless playback), it's necessary to pad the end of the file to a frame multiple but store how much padding there was. Ogg vorbis does this implicitly by asking the decoder to work out the sample number of the first sample of the last page by some other means, then discard samples at the end to match the specified file length. What other means? Perhaps by looking at the previous page's granule position. The upshot of this is that this method is perfectly viable if you're willing to first gather some information about the last page in a separate pass and cache it. If not, then you'll have a binary search that knows the first and last sample number of many of the pages that it probes, but not the last. This leads to a messy boundary case in the binary search, and lots of extra code, so I punted it. If I have to infer the first decoded sample in a page from the last decoded sample in the previous page for the last page, I might as well just do it all the time, and the binary search becomes simpler since I'm just comparing against that in a regular way: REASONABLE IMPLEMENTATION #2 Ok, let's bail on that. Let's just binary search using the end location for each page. What do we do once we get there? Ok, we end up with two adjacent pages, and we know the target sample falls between the end of the first page and the end of the second page... that means it's on the second page, so we just decode that. The details are a little subtle because of the overlapping, but you look close and it's really true: if the last sample (plus one, technically) of a page is the last sample that can be decoded from that page, then the very next sample must be fully decodable from the following page. So, yeah, you bail on the first page and start with the second page. Now you have to figure out how many frames to skip, and how to start things up so it recovers, but it should basically just work, with a bunch of tedious partial-decoding to figure out the frame lengths without actually fully decoding the audio data. (If you didn't care about efficiency, you'd just go ahead and decode it all and suppress it, naturally.) Oh yeah, except I forgot to mention one more detail. There's another flag in the page header, which indicates if a frame is continued from the previous page. Yeah, what, I never mentioned that before? Right, sorry. So yeah, for no explicable reason, the ogg spec allows the last packet of data on a page (the packet being the data that decodes one frame) to be split onto the next page. You can tell that from each page, so it's no problem to know that it's happening, that the last packet is continued or the first packet continues from the previous page. The problem is this means that it's not actually true that the first decoded sample from a page is given by the granule position in the previous page. The granule position is of the last sample decoded by a given page, which means _not_ counting any continued packets. So: page with 6 packets +------+---+---+-----+--+--+--+ |header| | | | | | | +------+---+---+-----+--+--+--+ ^ ^^ | || | |+--normal granule pos | +--last page granule pos +--incomplete last packet granule pos (I didn't bother drawing in the special case granule pos for the first page, since in this case it can be interpreted identically to the normal case.) This leads to... DUMB IMPLEMENTATION #3 Find the pair of pages that bracket the target sample number as above. Now, skim through all of the first page except any continued last packet. Now, start the process from #2 but with this last pseudo-packet. Note that if you have to parse through the page twice (like I do in my code--once to find which frame has the right sample, and another time to start decoding one frame _before_ that), you now have to reparse the first page twice too. (If you want to be efficient, of course, you can have separate cases for whether the first packet is continued or not, but it's a significant complexity increase.) This works (I think, I haven't implemented it yet.) Oh yeah, except for one case I left out... you can have a page on which no decoded samples finish! (I guess because the only packet is a continued one... i'm not sure if you have two packets, and the first continues from the previous page and the second continues onto the next page, whether that first one counts as finishing on this page... I assume so.) In that case, the granule pos is a special magic value (-1), and my binary search logic will totally barf. Which means we actually need yet another implementation... No, on second thought, fuck you, ogg vorbis seeking. WHY THIS IS SO STUPID You could, you know, use the flag on the last page to indicate whether or not a 2-byte value appears which indicates the number of samples to discard on the last page, without complicating the regularity of the meaning of the granule position. Similiarly for the first page. 2-bytes is plenty for vorbis, but if we pretend that ok, ogg is supposed to be more general purpose, then 4 bytes each. I think we can afford an extra 8 bytes _in the entire file_. (The granule position itself is 8 bytes per page.) Similarly, the file could actually have two positions per page-- one for the start and one for the ending sample. It could even store one relative to the other to keep the size down. And we could make up for this space issue by storing the vorbis packet sizes in a saner way. Ogg uses a bizarre strategy that costs ceil(size/255) bytes to store the size. The logic is that if your packets are usually < 254, this is optimal if the sizes aren't going to be bitpacked. On the other hand, you have to parse out these as-many-as-255 bytes encoding all the sizes in each page to even figure out how big the page is, rather than get 2 bytes encoding that. Maybe the packet sizes could have been huffman coded or something that would actually be _more_ efficient for both <255 AND >255-byte packets, and the space used for things like more positioning information and the total page size. Finally, the continued-packet thing is extremely lame. Maybe the intent was really to allow there to be packets larger than a single page (which is limited to 64KB), I'm not sure. libvorbis generally generates 4K pages and still goes ahead and splits up packets at the page boundaries sometimes, inexplicably. There's no win to having a packet split across the page, though. You should just extend the previous page to be a little longer, and shorten the next one. Being able to seek into the page a little earlier (in the "middle" of a packet) doesn't actually let you decode any sooner, since now you're in the middle of a packet you don't know how to decode. All it does is force you to rewind by a _whole page_ at times. And of course, one really useful tactic is the opportunistic playback: you seek somewhere and find a page and start playing it back... it would be nice when you did that if you knew what sample # you were playing. But you can't figure that out until you get to the _end_ (unless you look ahead and work back, which isn't what I mean by opportunistic). For that you really want the granule position to be the first sample, not the last. I already pointed out above it wouldn't be hard to store both, but in fact just storing the first instead of the last makes a lot more sense in general. (I think the only reason ogg stores the last is to make the last-page hack sensible--for a savings of 2 bytes per file hooray!)