sBOX File Format Specification v1.0

Current homepage: http://nothings.org/sbox/sbox.html

Abstract

This document describes and defines sBOX, a meta-file format for creating file formats whose internal contents are indexed by name. sBOX is a simple and carefully engineered file structure that provides a base layer upon which other file formats can build.

1. Introduction
2. Data Representation
3. File Structure
4. Directory Structure
5. Extensions
- 5.1. Copyable Form
- 5.2. Canonical Form
6. Limitations
7. Appendix: Summary of format
8. Appendix: Examples
9. Appendix: Rationale
10. Appendix: Protypical C language sBOX directory locator
11. Appendix: Resources
- 11.1. C language source code
12. Appendix: Sample File Format
13. Appendix: Version History & Credits
- 13.1. Version History
- 13.2. Credits

1. Introduction

The sBOX file format is a simple, lightweight, carefully defined and engineered meta-file format. It allows creation of various sorts of tagged or indexed file formats layered atop the core meta-format.

sBOX is designed to provide:

A collection of data organized as a series of <name, value> pairs, and an ordering of those pairs.
Fast performance for locating a value given a name, assuming that names are relatively small compared to values.

sBOX does not provide:

A mechanism for validating the integrity of contents (e.g. CRC).
An efficient disk-based indexing scheme (e.g. B-trees)--it is assumed the entire index will be stored in memory, or if a disk-based index is needed, it will be generated at run-time, rather than being part of the file format.

sBOX is designed to be used in write-once read-many applications, where data must be accessed in a random order. sBOX is designed to supply the file structure, while the sBOX client defines the actual content of the file format.

sBOX is less useful for a sequential-read data file format, where a simple linear sequence of names and values will suffice. It can be used for write-many file formats, but it was not designed to favor that approach.

sBOX is somewhat reminiscent of RIFF (Resource Interchange File Format), defined by Microsoft. However, sBOX is engineered to solve somewhat different problems; for example, RIFF is a sequential format. [The widely deployed WAV audio sample file format is layered on top of RIFF.]

The main part of this specification gives the definition of the file format. An appendix summarizes the file format in a simple table.

Another appendix provides examples of how to use the meta-file-format to construct other file-formats. A further appendix gives the rationale for many design decisions. Although these appendices are not part of the formal specification, reading them can help users understand the design and how it should be used.

See rationale: Why a meta-file format? Why a new file format?

2. Data Representation

All data in an sBOX file consists of either uninterpreted sequences of bytes or 4-byte integers. Integers are stored in "little-endian" order: the least significant byte first, then each of the more significant bytes, in order.

See rationale: Byte order

3. File Structure

An sBOX file always contains a header, a tail, a directory, and zero or more data blocks.

3.1. sBOX header

The first twenty-four bytes of an sBOX file constitute the sBOX header. The first sixteen bytes are undefined; any set of values in the first sixteen bytes can still indicate a valid sBOX file.

The following four bytes (the seventeenth through twentieth) contain the sBOX signature, and consist of the following decimal values:

  115 98 48 88

The second four bytes of the sBOX header are interpreted as an integer value; this value is referred to as Diroff in the remainder of this specification.

See rationale: Why sixteen undefined bytes? sBOX file signature

3.2. sBOX tail

The last eight bytes of an sBOX file constitute the sBOX tail. The location of the tail as an offset from the beginning of the file must be a multiple of four.

The second four bytes (the last four bytes of the file) of the tail must contain the sBOX signature:

  115 98 48 88

If (and only if) the value of Diroff found in the header is 0, then the first four bytes of the tail are treated as an integer and Diroff is understood to be this value, rather than the 0 value found in the header.

See rationale: Why a tail? Why two Diroffs? Is the tail really eight bytes?

3.3. sBOX directory header

The value of Diroff (which is defined in either the header or the tail) is understood to be the file offset of the directory. The value must be a multiple of four.

Additionally, Diroff must be greater than or equal to eight, and must be less than or equal to twelve less than the length of the file.

The first four bytes found at the location Diroff must contain the sBOX signature:

  115 98 48 88

The directory proper contains the names and provides the location of the values. This is described in the next chapter, 4. Directory Structure.

3.4. Data Blocks

Note: This section does not actually specify anything; it merely provides some context.

The value fields of each <name, value> pair are stored in data blocks which can be located anywhere in the file. The directory specifies the location and length of these blocks.

3.5. Miscellaneous File Layout Information

The sBOX file format is specified in terms of the data required to parse it. No particular constraints are placed on the general layout of data; e.g. data blocks can overlap or contain each other; a data block can overlap the directory, the head, or the tail. There can be space in the file which does not belong to neither the header, tail, directory, or any data block. The only explicit constraints on file layout, besides the precise location of the header and tail, are:

The header cannot overlap the directory.
The directory cannot overlap the defined values in the tail.

The extra terminology about the directory-tail interaction refers to the fact that the Diroff value in the tail isn't actually used if the Diroff in the header is non-zero.

See rationale: Why allow overlapping blocks?

4. Directory Structure

The directory header consists of eight bytes starting at the location Diroff. The first four bytes are the directory signature. The fifth through eight bytes are interpreted as an integer value called Dirsize; it must be a multiple of four.

The directory proper begins at location Diroff+8. It is exactly Dirsize bytes long. Each <name, value> pair in the file has a single entry in the directory. The number of items in the file can be inferred from the directory, but is not stored explicitly.

The directory consists of a sequence of directory entries, each stored consecutively.

A directory entry consists of four fields, plus padding:

value location The location of the value in the file.
value size The length of the value.
name size The length of the name
name data The bytes of the name itself.
padding 0 to 3 padding bytes (value must be 0) which pad the length of the directory entry to be a multiple of four.

The first three fields are integers. Thus, within each directory entry, the ofset of the value location is 0, the offset of the value size is 4, the offset of the name size is 8, and the offset of the name data is 12.

The next directory entry appears immediately after the padding, in other words, at a relative offset of 12+Namesize+padding-length. See the summary table for an explicit representation of the length of the padding.

The final directory entry must end exactly at the end of the directory (that is, Dirsize bytes after the beginning).

Names can contain any sequence of byte values. Names need not be unique. Names need not be in any particular order.

See rationale: Why an explicit name size? Why padding bytes? Why a directory size instead of a directory count? Why aren't names constrained?

This completes the specification of the core sBOX file format.

5. Extensions

This section defines several possible formal properties of sBOX files. These definitions may be useful in defining derived file formats.

5.1. Copyable Form

An sBOX file is said to be in copyable form if it can be safely copied by a generic sBOX copier. Whether a file is in copyable form is determined both by its physical layout, and by certain semantic qualities. Without knowledge of the semantics, it is impossible to say whether a given file is copyable or not.

All blocks must be non-overlapping.
The file should not contain any semantically meaningful data in the "dead areas" of the file (that is, in the portions of the file that are neither the header, tail, directory, nor one of the data blocks defined by the directory).
The file should not contain any references to absolute locations within the file, except those defined by the sBOX format. (References within a block, by offset from the beginning of the block, are allowed.)

Essentially, a file in copyable form will still "contain the same information" if it is copied by a file copier which only copies the data exposed via the sBOX interface.

It is strongly recommended that derived file types require the file format be copyable.

5.2. Canonical Form

An sBOX file is said to be in canonical form if it obeys the following list of constraints.

The file is copyable.
Diroff is defined in the header.
The value of Diroff is 24.
The data blocks appear in the same order in the file that they are referenced in the directory. In other words, the directory entries are sorted in order of their value's file offset.
The first data block appears immediately after the end of the directory (at offset 32+Dirsize)
After each data block there are 0..3 bytes, zero-valued, sufficient to "pad the data block" to a 4-aligned address (an address which is a muliple of four). The next block (or the tail) starts at this 4-aligned address.

The "canonical form" for a given sBOX file is unique; any and all sBOX writers/copiers should produce an identical canonical-form file given the same ordered <name, value> pairs.

It is recommended that if a derived file format wishes to require a single fixed format (e.g. because it is desired that file-compares indicate whether file "contents" are identical), then the canonical form should be used.

See rationale: Canonical form

6. Limitations

The sBOX format is limited to 4G files. (A 64-bit version of sBOX which uses 8-byte integers and 8-byte alignment would be easy to specify.)

The sBOX format only provides gross structuring mechanisms. The content of data blocks is left entirely to the handling of clients/applications. For example, clients must deal with byte ordering issues of the content of the name and data blocks.

The copyable format (and hence the canonical) format put obvious and relatively intuitive constraints on the sorts of data that can appear in a file. However, this may be at odds with other constraints. For example, a file format which wants to be robust in the face of imperfect transmission might want to provide redundant offset information which sBOX does not allow. It might want to escape certain byte sequences to guarantee they only happen in controlled situations. sBOX does not and cannot allow these sorts of restrictions. Most of the time, however, such a file format will want to be a streamable, sequential format anyway, in which case sBOX is a poor match in the first place.

7. Appendix: Summary of Format

Integers are stored in little-endian form.

sBOX FORMAT
offset length value
0 16 any sixteen bytes
16 4 signature: "sb0X" in ASCII
20 4 0 or Diroff (offset of directory)
Diroff 4 signature : "sb0X" in ASCII
Diroff+4 4 Dirsize (size of directory)
Diroff+8 Dirsize directory values (see table below)
n-8 4 Diroff (optional)
n-4 4 signature: "sb0X" in ASCII

sBOX FORMAT
offset	length	value
0	16	any sixteen bytes
16	4	signature: "sb0X" in ASCII
20	4	0 or Diroff (offset of directory)
Diroff	4	signature : "sb0X" in ASCII
Diroff+4	4	Dirsize (size of directory)
Diroff+8	Dirsize	directory values (see table below)
n-8	4	Diroff (optional)
n-4	4	signature: "sb0X" in ASCII

Here n is the length of the file. n must be a multiple of 4.

The second to last entry, a 4-byte Diroff, is only required if the value at offset 4 is 0. If it is non-zero, then the n-8 optional-Diroff should never be tested by a file reader, and whether it is present or not is irrelevent.

The directory consists of sequential variable-length items, each in the following format:

Directory Entry
offset length value
0 4 offset of item's data
4 4 size of item's data
8 4 Namesize (size of item's name)
12 Namesize item's name
12+Namesize (-Namesize)&3 0..3 padding bytes (all 0)

Directory Entry
offset	length	value
0	4	offset of item's data
4	4	size of item's data
8	4	Namesize (size of item's name)
12	Namesize	item's name
12+Namesize	(-Namesize)&3	0..3 padding bytes (all 0)

Here (-Namesize)&3 is the C code to compute the following function which pads the name to a multiple of four:

Let b1b0 be the bottom two bits of the binary representation of Namesize.

Namesize mod 4 b1b0 (-Namesize)&3
0 00 0
1 01 3
2 10 2
3 11 1

8. Appendix: Examples

8.1. Minimal sBOX File

The smallest possible sBOX file is the following:

       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
      115  66  48  88        header signature
       24   0   0   0        Diroff
      115  66  48  88        directory signature
        0   0   0   0        Dirsize
      115  66  48  88        tail signature

This file happens to be in canonical form.

The minimal sBOX file with Diroff in the tail is:

       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
      115  66  48  88        header signature
        0   0   0   0        Diroff
      115  66  48  88        directory signature
        0   0   0   0        Dirsize
       24   0   0   0        Diroff
      115  66  48  88        tail signature

A minimal sBOX file with a single directory entry is:

       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
      115  66  48  88        header signature
       24   0   0   0        Diroff
      115  66  48  88        directory signature
       12   0   0   0        Dirsize
        0   0   0   0        location of first value
        0   0   0   0        length of first value
        0   0   0   0        length of first name
      115  66  48  88        tail signature

Note that, consistent with the padding rules, a 0-byte name has 0 bytes of namedata and 0 bytes of padding.

A minimal sBOX file with a non-trivial <name, value> pair is:

       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
       ??  ??  ??  ??
      115  66  48  88        header signature
       24   0   0   0        Diroff

      115  66  48  88        directory signature
       16   0   0   0        Dirsize
       48   0   0   0        location of first value
        1   0   0   0        length of first value
        4   0   0   0        length of first name
       65  66  67  68        name: ABCD

      255                    value: a one-byte value 255
            0   0   0        padding to align the block

      115  66  48  88        tail signature

This file is the canonical form for the file containing the single <name,value> pair that could be notated in C as { "ABCD", "\0377" }.

The above example has been divided into four sections: head, directory, a data block, and the tail. As you can see, there are no markers or dividers indicating the boundaries between sections of the file; it is all implicitly described by the directory and Diroff. (The signature '115 66 48 88' appears at the head of three of the sections, but it can appear anywhere in data as well, so does not provide a criterion for detecting boundaries, only for validation.)

8.2. Adding a signature to a derived file format

Many file formats need their own signature so that a file reader can be sure it is reading the correct file type, and not some other sBOX file format.

A derived file format simply defines a fixed set of values for the first sixteen bytes of the sBOX file. Derived file formats should not put any file-specific data in this header--flag bits, size information, and any such should all appear in data, to maximize compatibility with generic sBOX applications.

8.3. Using More than Two Fields

Some file formats might want to have more than just <name, value> pairs; they might want a triplet of values, or more.

This is easy to do, it just requires reconceptualizing name and value. Because the name and the value are both uninterpreted byte streams from the point of view of generic sBOX code, there is no reason they can't each represent more than one field, i.e. by just concatenating such fields together.

Simply let name be all the information you want access to "instantly", and have value incorporate all information that need not be available after only loading the directory.

Alternately, let the name include any information which you might want to search or index on.

Altermately, let the name include only the smallest pieces of data.

The name (and the value) can easily encode any number of fixed-length fields and a single variable-length field by placing the fixed-length fields first. It can also encode several variable-length fields if those variable-length fields have their lengths explicitly encoded. With multiple variable length fields, you may need a mini-directory of them to accelerate access. (You could even make each name or value be an sBOX file itself!)

The content of name and value are entirely defined by the client of sBOX, and anything desired can be done with them.

8.4. Checksums / CRC

Some applications might desire to provide checksums, for example a 32-bit CRC, so as to make it possible to validate the integrity of the file.

There are a number of possible approaches.

A checksum can be prefixed "outside" the sBOX proper, wrapping the entire sBOX inside another file. However, this is not recommended, as it simply increases the code necessary to use the file format, instead of leveraging the structure provided by sBOX. It also means generic sBOX-file handling code cannot process the file.

Each data item could compute a separate checksum, and that could be incorporated as a field in the name of the object (see the multi-field examples). A separate checksum of the directory might still be wanted, but this is the recommended approach, as it is robust with generic sBOX operations (e.g. deleting a <name,value> will still let other blocks checksum correctly).

The entire file could be checksummed. That checksum could be stored in a data block, with a special name denoting it. However, one would probably not want to include that data block in the checksum, as it would make generating the checksum difficult.

Instead of checksumming the entire file, only the "sBOX contents" of the file could be checksummed--i.e., the contents as exposed through the sBOX interface. This has the advantage that the file would be copyable without invalidating the checksum.

Instead of storing the checksum contents in a data block, they could be stored as a name, e.g. with a prefix indicating it is a checksum.

A checksum of each of the data items could be made and stored as an array of checksums in yet another data item.

8.5. Avoiding Namespace Collisions

Some file formats might not want to use a fixed set of names to refer to data; they might be generated dynamically, or even chosen by the user. Additionally, they might want to store fixed-name data, without the possibility of the dynamically-generated or user-chosen names causing confusion with the fixed names.

There are several approaches which can be taken to implement this.

Explicit Namespaces

One approach is to prepend characters to the name so as to indicate a namespace. A verbose version of this would be to prefix internal names with the string "internal/" and external names with the string "external/"; thus if the user attempted to create a data block called "head", it would actually be called "external/head"; meanwhile, the file format could manage some general-purpose header data in a value named by "internal/head".

In its most concise form, a single character suffices to distinguish both. For example, user-generated names could always be prepended with "_", while internal fixed names can be chosen to never start with "_" (or explicitly prefixed with some other character if so desired).

The above approach might seem wasteful (although it's only one byte per name on average), or perhaps clumsy. They have the "advantage" of allowing fixed names and generated names to be intermixed in a directory without there being any possible confusion between them. An alternate approach is to disallow that possibility, and make use of the structure of the directory.

Directory Partitioning

If a file format always has a certain number of fixed names that will appear in it, the file format can say that those names always appear first, and then all the remaining names in the directory are "generated" names. Then, when searching for such generated names, only the relevent part of the directory should be searched.

If the number of such names isn't known in advance, then the location of the first generated-name in the directory can be explicitly noted, either by storing its index number within the directory in a data block (or in a name), or by using a special name to indicate the split point.

For example, I might have a file format in which any of the following standard predefined names can appear:

     header
     author
     summary
     creation date

I can then require that all such internal names appear first. Then after them appears the name general data, and all names after this are interpreted as being in the "external" namespace.

Using such an approach requires writing special directory parsing code that keeps track of the split point.

File Substructuring

Rather than write such partitioning code, if you are willing to pay an extra 36 bytes, you can simply nest sBOX files. The reference sBOX reader implementation makes this very simple to read. (However, the reference implementation doesn't support writing them directly).

One way to do this would be to build an sBOX file which has two <name, value> pairs, one named "internal" and one named "external". The value of the one named "internal" would be an sBOX "file" which has the pre-defined fields; the one named "external" would have the generated or user-defined fields.

On the other hand, it might be slightly simpler, and would save a little space, to instead have the main sBOX file able to store all of the pre-defined fields; then the name "external" or "general data" could be used to label a data block which is an sBOX file whose contents are the user-defined names and their associated values.

Note that sBOX files with nested contents can still be both copyable and canonical. On the other hand, a generic sBOX reader doesn't have any way of knowing that a given value denotes an sBOX sub-file, and will not automatically handle it; the generic sBOX-processing tools will not do the "right thing" when operating on substructured files. For example, if you use an sBOX file with only two pairs, one named "internal" and the other "external", and you attempt to "concatenate" two such files, you'll end up with an sBOX file with four names, two each of "internal" and "external". You have to write a smart merger that knows about the substructures and concatenates each of them separately.

8.6. In-Place Modification

sBOX is designed to favor write-once, read-many applications. Nonetheless, it is possible to make in-place modifications to an sBOX file.

Data which changes without changing size can simply be rewritten in place.

If a data block needs to get bigger, append a new copy of the data block at the end of the file, and abandon the old one; update the location of this block in the directory.

If the directory needs to grow (either due to a new item, or an existing item being renamed to a longer name), write a new copy of the directory at the end of the file, and update Diroff.

Append-only Modification

It's possible to define a writeable sBOX file format in which all modifications occur by appending new data; no data in the file ever need be rewritten.

When initially creating such a file, make sure the value of Diroff in the header is 0, so the value in the tail will always be used by readers to determine where the directory is.

When data changes, append a new block of data at the end of the file (after the existing directory and tail). When all changes have been made to the file, write out a new copy of the directory with the appropriate changes, and a new tail with the address of this new directory. Old data, the old directory, and the old tail all become "dead areas" in the file.

[Recopying this file with a generic sBOX copy program provides an easy way to remove the dead storage.]

9. Appendix: Rationale

9.1. Why a meta-file format?

A lot of file formats have been written during the relatively brief history of modern computers. sBOX attempts to provide a single underlying structure which could be shared by numerous unrelated file formats. It allows a file format to separate out its notion of structure (using a generic, reusable structuring methodology defined by sBOX) from its notion of content. This makes implementation of new file formats somewhat simpler--design time is reduced, and code can be reused. Additionally, for many derived file types, generic sBOX writing code will provide all necessary file-writing functionality.

Furthermore, it is possible to make a file reader that can read any sBOX file; this is useful for writing validators, canonicalizers, and other tools that will work on any such files. (This latter notion was inspired by SGML.)

9.2. Why a new file format?

sBOX was designed to meet a set of constraints which I needed met, but for which I couldn't find any existing file format. I then tried to generalize sBOX as much as possible, within those constraints.

Most of the other rationales describe the constraints which sBOX was engineered to work within. Additionally, sBOX is designed to reasonably minimize storage overhead--overhead is 20 bytes plus 12 per item (plus alignment padding).

9.3. Byte order

A single byte order needed to be selected for sBOX. While there are good arguments for favoring "big endian", I chose "little endian", primarily because the sBOX file signature, when viewed as a big-endian number, is a multiple of four, and thus could be misinterpreted as a valid directory location or directory size.

The byte order is entirely hidden from client code. A file format built on top of sBOX can use any byte ordering of its data that it wants; the sBOX library code will deal with sBOX's byte ordering, and leave the application to cope with the content's byte ordering itself.

9.4. sBOX file signature

The sBOX file signature was chosen to be readable, yet unlikely to appear in a file accidentally. (The 'O' in sBOX is encoded as a '0'.) The reappearance of the signature in the directory makes it unlikely that a random file might be misinterpreted as an sBOX file.

Since derived file formats will probably define their own signatures anyway, other constraints that such file formats would want a signature to meet can be placed on their own signature. Adding a file format signature is discussed in the examples.

9.5. Why a tail?

The tail of an sBOX file serves two purposes. It provides an easy way of detecting if the file has been accidentally truncated, which is the most likely way for a file to become corrupted.

The second purpose is described in the next rationale.

9.6. Why two Diroffs?

The existence of two locations that can store Diroff may seem like it provides an unnecessary complication for sBOX readers. Each of the possible locations has its own advantage:

Putting Diroff in the header has the advantage that it can be read at the same time the initial file signature is read, thus minimizing the amount of time necessary to find the directory and thus access the data.
Putting Diroff in the tail allows a file writer to write an sBOX file in a single pass without rewinding, which is useful if the writer is outputting to a streaming interface which doesn't allow rewinding. It also makes it possible to define a file format in which a file can be modified in-place using only append operations.

The upshot is this: it's easy to code a reader to handle both. If you want the performance benefit of having the offset in the header, simply make sure you use a writer which places it there. When defining a new file format using sBOX as a base, you can always require that it have a valid Diroff in the header.

The only case where it's not easy for a reader to handle both is with a "streaming" reader (one which can't access the file randomly), since it can't get at the location in the tail without reading the entire file. However, sBOX is explicitly not designed to support single-pass reading (after all, using a directory just makes single-pass reading harder anyway), so this concern is irrelevent. Use a different file format (or restrict your use to the canonical form) if you want single-pass reads.

9.7. Is the tail really eight bytes?

No, not really. If the header contains a non-zero Diroff, then the tail is really only the last four bytes. However, phrasing it as if the tail is always eight bytes simplifies the description (except for the non-overlapping property). It also simplifies the code in the reference implementation.

9.8. Why allow overlapping blocks?

Overlapping blocks are allowed because some deriveed file formats might wish to allow overlapping blocks. The canonical format disallows overlapping blocks, and the "generic sBOX" tools do not correctly preserve overlapping blocks either. But there's little to be gained by explicitly prohibiting them, especially since the reference reader can read them fine. Since sBOX isn't intended to be a read/write file format, the issues of "what happens when you write to a portion of the file that's shared between multiple data blocks" is irrelevent.

The extended copyable format disallows overlapping blocks, because "naive" file rewriting will not preserve overlapping blocks.

9.9. Why an explicit name size?

Some other file formats use a terminating character to indicate the length. Other formats use fixed length names, such as 11 or even 4 characters. Some formats restrict the character set allowed in names.

A file format derived from sBOX is welcome to make such restrictions on its names if it so desires. The overhead of using an explicit name size is small, and allows other file formats which have arbitrary length names, or names which include all possible characters.

It would be rather silly to restrict the maximum length name or pick a fixed-length size, since there are definitely file formats that can be built on sBOX which require long names (e.g. file archivers or applications which wish to pack multiple fields within the sBOX 'name').

9.10. Why padding bytes?

Padding bytes after each directory entry guarantee that if the entire directory is read into memory as a single block, each of the integer fields in the directory will be 4-byte aligned; this is required on some processors and a performance gain on others.

Similarly, padding bytes in the canonical format guarantee that if the entire file is read into memory, each data block will be 4-byte aligned.

9.11. Why a directory size instead of a directory count?

The use of a directory size allows the entire directory to be read into memory in a single pass. Since this is a likely behavior, and the cost is small (4 bytes and the need for the file writer to store it), it is included.

The directory count can be computed while scanning the names, and it is unlikely that there is an application that would need the count but would never scan the names.

An alternate design would include the directory size with the directory offset, e.g. at either the beginning or ending of the file. That way, an sBOX writer wouldn't need to know the directory size before it writes the directory.

I decided that the extra complexity and lack of modularity of this approach wasn't worth the simplification for file writers. An sBOX file writer must write the directory somewhere, and the directory is variable length. Since it can't build the directory on disk at the same time that it builds the data on disk, I didn't see a reasonable scenario under which it wouldn't know (or couldn't easily compute) the entire directory size when it wrote the directory.

9.12. Why aren't names constrained?

Names are not constrained to be unique or in any particular sort order in an sBOX file.

Such constraints are left to derived file formats, which can impose any such constraints desired. No such constraints are included in the base format for reasons of generality--there are definitely file formats in which sorting is inappropriate, and there are applications in which repeated values are allowed.

9.13. Canonical form

This section provides rationale for why the particular canonical form was chosen, not a rationale for the existence of a canonical form.

Diroff

The canonical form places the directory offset in the header to allow the directory to be read as quickly as possible by readers which don't know the file is in canonical form. (In this case, readers can avoid reading the tail entirely, although at the risk of not immediately detecting file corruption.)

Directory location

A file reader which only handles canonical files (e.g. because a derived file format specifies it) can seek to a fixed location in the file to read the directory--if the directory were last, this would not be the case. In fact, such a file reader can simply read the first 16 bytes to read the header and the directory header all at once.

Placing the directory at the front of the file is advantageous for limited readers. A non-rewinding file reader can still parse the file. Furthermore, if the file is streamed, it is possible to access data without having seen the entire file.

While the sBOX file format has been explicitly designed not to favor no-rewind or streaming formats, it is always possible that a file format derived from sBOX will end up being transmitted over a network or used in some other environment. In the general case, applications can help these by favoring producing sBOX files with directories stored first. If a derived file format wishes to enforce a single fixed format, however, applications do not have any choice; and so the canonical form (which is, after all, merely a recommendation) suggests putting the directory first.

Data alignment

Aligning data blocks allows the entire file to be read into memory without causing alignment problems accessing the data. The argument for this behavior is similar to that in the previous paragraph; it eases the burden on file formats which might desire this property, while causing only minimal overhead on those which do not need it.

Miscellaneous

The remaining properties of the canonical form are either inherently necessary for a canonical form (e.g. non-overlapping blocks), or arbitrarily chosen according to my opinion of the simplest or most natural (e.g. blocks appearing in the same order as directory order).

9.14. Why sixteen undefined bytes?

The first sixteen bytes of an sBOX file are not defined by the sBOX file format. The first four or sixteen bytes are often used by operating systems as "magic numbers" which indicate the file type. Since many file types may be built on sBOX, it does not make sense to have a single "sBOX file type".

Some applications may be able to operate on any sBOX file type (i.e. generic sBOX processors); however, such programs probably do not need OS-based "magic number" dispatching, and can verify that a file is an sBOX file manually.

10. Appendix: Protypical C language sBOX directory locator

#define COOKIE_LEN          4
unsigned char cookie[] = "sb0X";

#define test_cookie(str) (!memcmp(str, cookie, COOKIE_LEN))
#define little_int(str)  ((((str)[3]*256+(str)[2])*256+(str)[1])*256+(str)[0])

enum SBOX_ERROR
{
   SBOX_OK=0,
   SBOX_MISSING_HEADER,
   SBOX_MISSING_TAIL,
   SBOX_BAD_DIRECTORY_OFFSET
};

int sBOX_verify_and_seek_directory(FILE *f)
{
   unsigned long diroff;
   unsigned char buffer[8];

   fseek(f, 16, SEEK_SET);
   if (fread(buffer, 8, 1, f) != 1) return SBOX_MISSING_HEADER;
   if (!test_cookie(buffer))        return SBOX_MISSING_HEADER;

   diroff = little_int(buffer+4);
   fseek(f, -8, SEEK_END);
   if (fread(buffer, 8, 1, f) != 1) return SBOX_MISSING_TAIL;
   if (!test_cookie(buffer+4))      return SBOX_MISSING_TAIL;

   if (diroff == 0)
      diroff = little_int(buffer);

   fseek(f, diroff, SEEK_SET);
   return SBOX_OK;
}

11. Appendix: Resources

11.1. C language source code

The author of the sBOX specification has written a "reference" implementation of sBOX reading and writing code in the C language.

All source code is provided free of charge, and may be freely used, modified, and redistributed for any purpose.

Available sources are:

File Requires Description
sboxlib (27K) All of the files below

sboxread A reference sBOX reader which reads from a FILE *

sboxwrit A reference sBOX writer which manages a directory in memory and writes it out last.

sboxkit sboxread, sboxwrit A utility toolkit layered over sboxread and sboxwrit

sboxlib sboxread, sboxwrit, and sboxkit a combined header file which can be used if sboxlib is compiled into a library.

box sboxlib Prints out the directory information from an sBOX file, creates a new file, adds an entry to an sBOX file using another file as the data value, renames an entry, or deletes an entry. For maximal instructive purposes, does not use sboxkit.

sboxcan sboxread Reads in an sBOX and writes out an sBOX in canonical form.

sboxcat sboxlib "Appends" two or more sBOX files while copying them.

sboxsort sboxlib Sorts the names in an sBOX file while copying it.

sboxuniq sboxlib Deletes non-uniquely-named entries from an sBOX file while copying it.

readme.txt (36K) Library documentation

sboxcan, sboxcat, sboxsort, sboxuniq are not yet written.

File	Requires	Description
sboxlib (27K)		All of the files below
sboxread		A reference sBOX reader which reads from a FILE *
sboxwrit		A reference sBOX writer which manages a directory in memory and writes it out last.
sboxkit	sboxread, sboxwrit	A utility toolkit layered over sboxread and sboxwrit
sboxlib	sboxread, sboxwrit, and sboxkit	a combined header file which can be used if sboxlib is compiled into a library.
box	sboxlib	Prints out the directory information from an sBOX file, creates a new file, adds an entry to an sBOX file using another file as the data value, renames an entry, or deletes an entry. For maximal instructive purposes, does not use sboxkit.
sboxcan	sboxread	Reads in an sBOX and writes out an sBOX in canonical form.
sboxcat	sboxlib	"Appends" two or more sBOX files while copying them.
sboxsort	sboxlib	Sorts the names in an sBOX file while copying it.
sboxuniq	sboxlib	Deletes non-uniquely-named entries from an sBOX file while copying it.
readme.txt (36K)		Library documentation

All sample programs copy sBOX files when "editting" them, rather than operating in place. Programs which allow the user to specify names are restricted to names which can be entered by the user on a commandline, and cannot include the null character; however, the underlying core code they use is more general.

sboxread is production-quality code. It does complete error checking and reporting. It also supports operating on smaller blocks of a file as if it were the entire file, which is useful for nesting sBOX files within each other, or for prepending additional file signatures. It futher supports processing files whose directories are too big to be stored in memory (e.g. because names are actually very large, or because there's a very large number of names); however, it doesn't support efficient processing of such files.

The other programs provide "tool-user" as opposed to "end-user" quality code. For example, whereas sboxread can read any sBOX file, sboxwrit only outputs a single format (and it's not the canonical one), and sboxwrit requires the entire directory fit in memory.

12. SBI: A Simple Image File Format

The following code illustrates an extremely simple image format which can store 24-bit and 32-bit images, and a short program that converts between the two formats, built on top of the reference implementation of sBOX code. Of course this image format is trivial because it is uncompressed, but it shows clearly how using sBOX allows a file format to separate out its notion of content from structure.

On the other hand, I was very lazy coding this. The "right way" to do it would be to actually make values read from the header separate, so there's a <"height", value> and a <"width", value> entry, rather than a single 8-byte structure that you have to manually parse--since that's the whole point of sBOX. (I didn't do it that way because the significant overhead for just a 4-byte value bugged me, but there's few enough that that's probably the better approach in this case.)

#include <stdlib.h>
#include "sboxlib.h"

#define HEADER_RGB_888    "sbi 888 RGB image"
#define HEADER_RGBA_8888  "sbi 8888 RGBA image"
#define PIXEL_DATA        "image"

/*
 *  This data structure is used for internal processing,
 *  but is not part of the file format
 */
typedef struct
{
   int w,h;
   unsigned char *pixel_data;
   int number_of_channels;
} ImageData;

int loadBitmap(char *filename, ImageData *i)
{
   int result;
   int *buf, w, h, n;
   unsigned char *data;
   if (SboxReadOpenFilename(&sbox, filename, "sB0xImageFormat"))
      return 0;

   // look for an image header with one of the above names, 8 bytes long
   if (SboxkitGetByString(&buf, sbox, HEADER_RGB_888) == 8)
      i->number_of_channels = 3;
   else if (SboxkitGetByString(&buf, sbox, HEADER_RGBA_8888) == 8)
      i->number_of_channels = 4;
   else { SboxReadClose(&sbox); return 0; }

   i->w = buf[0];
   i->h = buf[1];

   // read image data, check that it's as long as expected
   n = i->w * i->h * i->number_of_channels;
   result = (SboxkitGetByString(&i->pixel_data, sbox, PIXEL_DATA) == n);

   SboxReadClose(&sbox);
   return result;
}

// rather than a single function that can write both formats,
// here are two separate functions, one each, to show just how
// simple the code can be.  I made them write out in different
// orders just to make clear that in this application, the order
// isn't relevant (the reader works regardless)

int saveBitmap_RGBA_8888(char *filename, unsigned char *pixels, int w, int h)
{
   int sz[2] = { w,h };
   SboxWriteHandle *s;

   if (SboxWriteOpenFilename(&s, filename, "sB0xImageFormat")) return 0;

   if (SboxkitStringPut(s, HEADER_RGBA_8888, sz, 8    )) goto write_error;
   if (SboxkitStringPut(s, PIXEL_DATA,   pixels, 4*w*h)) goto write_error;
   if (SboxWriteClose(s)) return 0;
   return 1;

  write_error:
   SboxWriteClose(s);
   return 0;
}

int saveBitmap_RGB_888(char *filename, unsigned char *pixels, int w, int h)
{
   int sz[2] = { w,h };
   SboxWriteHandle *s;

   if (SboxWriteOpenFilename(&s, filename, "sB0xImageFormat")) return 0;

   if (SboxkitStringPut(s, PIXEL_DATA,   pixels, 3*w*h)) goto write_error;
   if (SboxkitStringPut(s, HEADER_RGB_888,   sz, 8    )) goto write_error;
   if (SboxWriteClose(s)) return 0;
   return 1;

  write_error:
   SboxWriteClose(s);
   return 0;
}

int main(int argc, char **argv)
{
   ImageData i;
   if (argc != 3) {
      printf("Usage: %s infile outfile\n", argv[0]);
      return 1;
   }

   if (!loadBitmap(argv[1], &i)) {
      printf("'%s' did not exist or is not an SBI file.\n", argv[1]);
      return 2;
   }

   if (i->number_of_channels == 4) {
      int j,n;
      printf("Converting from 32-bit RGBA to 24-bit RGB\n");
      n = i->w * i->h;
      for (j=0; j < n; ++j) {
         i->pixel_data[j*3+0] = i->pixel_data[j*4+0];
         i->pixel_data[j*3+1] = i->pixel_data[j*4+1];
         i->pixel_data[j*3+2] = i->pixel_data[j*4+2];
         // trim out the alpha values
      }
      if (!saveBitmap_RGB_888(argv[2], i->pixel_data, i->w, i->h)) {
         printf("Unable to write '%s'.\n", argv[2]);
         return 3;
      }
   } else {
      unsigned char *out;
      int j,n;
      printf("Converting from 24-bit RGB to 32-bit RGBA\n");
      n = i->w * i->h;
      out = malloc(n * 4);
      for (j=0; j < n; ++j) {
         out[j*4+0] = i->pixel_data[j*3+0];
         out[j*4+1] = i->pixel_data[j*3+1];
         out[j*4+2] = i->pixel_data[j*3+2];
         out[j*4+3] = 255; // opaque
      }
      if (!saveBitmap_RGBA_888(argv[2], out, i->w, i->h)) {
         printf("Unable to write '%s'.\n", argv[2]);
         return 3;
      }
      free(out);
   }
   return 0;
}

13. Appendix: Version History & Credits

13.1. Version History

1999-03-29: Version 0.1 released
2000-08-18: Version 1.0 - introduced the leading 16-byte file signature for derived file formats, and the specification is hopefully final

13.2. Credits

The sBOX File Format Specification was written by Sean Barrett.

The format and some of the language of the sBOX File Format Specification were stolen from the PNG (Portable Network Graphics) Specification, editted by Thomas Boutell.

The End