[TYPES] Type theory of file formats

Jimmy Koppel jkoppel at mit.edu
Sat Aug 20 00:07:38 EDT 2022


There’s a beautiful fact, not often discussed, that the theory of subtyping
explains forwards compatibility between file formats.


For example here’s a simple file format for images:


type header_length_v1 = {x : uint32 | x >= 2}

type width = uint32

type height = uint32

type header_v1(n) =  width * height * uint32[n-2]

type image_data = uint32[]


type image_file_v1 = {n: header_length_v1} * header_v1(n) * image_data



This type allows for arbitrary extensions to the header, so that any
extension to this format which adds more header fields is a subtype. A
reader for this file format is thus forwards-compatible with any format
extension which adds more header fields.


However, even though I wrote it them above as products, the components of a
binary file format are not products. In particular, they do not allow
nesting, e.g.: (a*b*c)*d is not compatible with (a*b)*d.  Compatibility
between two binary formats is determined  by the same data being in “the
same place.”  Even though, classically, uint32 * uint8 is a subtype of
uint32, one could not replace  width in the type above with (uint32 * uint8),
as that would move the location of the height from bytes 8-11 to bytes
9-12, breaking existing code.


For my class, I made up a “flat sequence type” to describe file formats,
which effectively functions as a product type which must be
right-associated. I would instead write the header(n) type as width ++
height ++ uint32[n-2]. I have not worked out all the theory, but I’ve
noticed fewer students making mistakes about binary compatibly since
introducing this idea.


Can anyone point me to related work?


Yours,

Jimmy Koppel

MIT CSAIL


More information about the Types-list mailing list