[TYPES] Type theory of file formats

Mon Aug 22 01:03:24 EDT 2022

Hi Jimmy,

I’ve used a similar observation in this PL/compiler that I made for Morgan Stanley a few years back:
https://urldefense.com/v3/__https://github.com/morganstanley/hobbes__;!!IBzWLUs!Xn2Ulm5-C0403pcAOa7dzz_fxK-N1RaBDiDdLrQw0Gu9jezqnv25nAd6bchH-X7ZWeQD3y-fuvItutbVnbCfS5s6VVK1$  

… although I didn’t require strict field placement but just convertibility as determined by structural subtyping.  So e.g. ‘{x:int,y:int,z:int}’ is a subtype of ‘{z:long,x:double}’ as witnessed by an inferred function that drops “y”, swaps “x” and “z” while widening them into their respective types.  As a special case it optimizes for ’trivial convertibility’ (where the supertype is a prefix of the subtype) and the identity function is inferred.  There’s a flipped rule for sum types, where it’s always safe to convert into a large subsuming sum (hence the type 0 as the subtype for all types and 1 as the supertype for all types).

In this PL, I defined those rules as a system of type classes (ala Haskell), with the additional feature that constraints on qualified types are always eliminated at compile-time (since where this is used at Morgan Stanley, runtime performance penalties have to be minimized).

As well as being used to decide compatibility between file formats, this method is also useful to decide compatibility between independent message passing processes.  Previously at Morgan Stanley, we had a lot of software that kept “version numbers” and it could be very difficult to figure out if two “versions” of a message format could be used together (or on a larger scale, if a whole system was even coherent and ready to trade the next day).  Structural subtyping makes that problem much simpler, and also gives you a straightforward explanation for why e.g. two programs can’t communicate.

Also on the subject of file formats, beyond just “flat headers”, I found dependent sum types very useful to make “file references” as e.g. “T at f” is an offset in the file “f” (dynamically determined) where a “T” value can be found (which, if you mmap a large file, you still need to resolve outside of the file contents since different programs will map at different base addresses).  With this it’s easy to build regular recursive types and give type assignments to e.g. btrees and skip lists, decide subtyping on indexes this way, etc.

I also added probability distributions to structured types, so again requiring inference of a subtyping witness between convertible structures with different expectations, and used those expectations to encode structures down to their entropy limit — so that we can minimally store billions of ticks split out by symbol in a skip list and efficiently read them out of mapped files concurrently, automatically converting into a few different types for reporting tools and ad-hoc queries.

Hope this helps!

> On Aug 20, 2022, at 12:07 AM, Jimmy Koppel <jkoppel at mit.edu> wrote:
> 
> [ The Types Forum, http://lists.seas.upenn.edu/mailman/listinfo/types-list ]
> 
> There’s a beautiful fact, not often discussed, that the theory of subtyping
> explains forwards compatibility between file formats.
> 
> 
> For example here’s a simple file format for images:
> 
> 
> type header_length_v1 = {x : uint32 | x >= 2}
> 
> type width = uint32
> 
> type height = uint32
> 
> type header_v1(n) =  width * height * uint32[n-2]
> 
> type image_data = uint32[]
> 
> 
> type image_file_v1 = {n: header_length_v1} * header_v1(n) * image_data
> 
> 
> 
> This type allows for arbitrary extensions to the header, so that any
> extension to this format which adds more header fields is a subtype. A
> reader for this file format is thus forwards-compatible with any format
> extension which adds more header fields.
> 
> 
> However, even though I wrote it them above as products, the components of a
> binary file format are not products. In particular, they do not allow
> nesting, e.g.: (a*b*c)*d is not compatible with (a*b)*d.  Compatibility
> between two binary formats is determined  by the same data being in “the
> same place.”  Even though, classically, uint32 * uint8 is a subtype of
> uint32, one could not replace  width in the type above with (uint32 * uint8),
> as that would move the location of the height from bytes 8-11 to bytes
> 9-12, breaking existing code.
> 
> 
> For my class, I made up a “flat sequence type” to describe file formats,
> which effectively functions as a product type which must be
> right-associated. I would instead write the header(n) type as width ++
> height ++ uint32[n-2]. I have not worked out all the theory, but I’ve
> noticed fewer students making mistakes about binary compatibly since
> introducing this idea.
> 
> 
> Can anyone point me to related work?
> 
> 
> Yours,
> 
> Jimmy Koppel
> 
> MIT CSAIL