[TYPES] Type theory of file formats

Jimmy Koppel jkoppel at mit.edu
Wed Aug 24 03:21:14 EDT 2022


Thanks everyone for the links! This is great stuff, and it will take me a
while to look through everything. I'm slightly ashamed for not thinking of
Narcissus, having visited Benjamin at Purdue while it was in submission.

I'll in particular follow up with Henry and Frank privately, as their work
seems especially relevant.


On Tue, Aug 23, 2022 at 6:49 PM Kalani Thielen <kthielen at gmail.com> wrote:

> Hi Jimmy,
>
> Sure, there are rules for “safe casting and convertibility” here,
> specified as type class instances for a function “convert” that is the
> (overloaded) witness I mentioned:
>
> https://urldefense.com/v3/__https://github.com/morganstanley/hobbes/blob/main/lib/hobbes/boot/convert.hob*L117__;Iw!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDS_fn-xq$  
>
> The syntax is a little unusual, I took some inspiration from Mark Jones’s
> TREX (as well, inspiration for the implementation of qualified types here
> from his dissertation on the topic, though as mentioned before not the
> “dictionary passing” interpretation but a “compile-time rewriting” instead).
>
> e.g. the highlighted definition above looks like:
>
> instance (r={lbl:h*t}, s/lbl :: x, ConvertInto x h, ConvertInto s t) =>
> ConvertInto s r where […]
>
>
> The constraint “r={lbl:h*t}” says in English, “the type ‘r’ is a non-empty
> record type whose first field has a name ‘lbl’ of type ‘h’, with the rest
> of the record as ’t’”, then “s/lbl :: x” says “the type ’s’ can be indexed
> by ‘lbl’ to yield a type ‘x’” and blah blah blah, so you wind up deriving
> the function described before to select each of the necessary fields from
> the source into the destination type.
>
> There are also rules for “castability” as a preferred condition for
> “convertibility” as e.g.:
>
> https://urldefense.com/v3/__https://github.com/morganstanley/hobbes/blob/main/lib/hobbes/boot/convert.hob*L68__;Iw!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDekh7iGa$  
>
> instance (ar={lbl:ah*at}, br={lbl:bh*()}, Castable ah bh) => Castable ar
> br where […]
>
>
> Where the “from” type ‘ar’ and the “to” type ‘br’ share an exact common
> prefix up to the final field, which merely has to be castable.  The unit
> type is written “()” here like in Haskell and is the terminal case for
> generically splitting record types into heads/tails this way.
>
> If you’re interested, there’s a shell that’s part of the project here and
> you can interact with this overloaded function, which is maybe easier than
> reading the ‘boot’ code here:
>
> $ hi -s
> > :t {a={x=1,y=2,z=3}, b={x=1,z=3}, c={y=2}}
> { a:{ x:int, y:int, z:int }, b:{ x:int, z:int }, c:{ y:int } }
>
> > :t convert({a={x=1,y=2,z=3}, b={x=1,z=3}, c={y=2}}) :: {b:{z:double},
> a:{z:double,x:long}}
> { b:{ z:double }, a:{ z:double, x:long } }
>
> > convert({a={x=1,y=2,z=3}, b={x=1,z=3}, c={y=2}}) :: {b:{z:double},
> a:{z:double,x:long}}
> {b={z=3}, a={z=3, x=1}}
>
>
> The code for entropy coding “biased structured types” (types as
> probabilities?) is here, it’s a header-only C++ lib but it writes in this
> file format I mentioned where we also do this kind of structural subtyping
> (compression levels wind up being about the same or better than LZMA for
> our high volume tick data, but much more practical for our data sets given
> how much faster this method is):
>
> https://urldefense.com/v3/__https://github.com/morganstanley/hobbes/blob/main/include/hobbes/cfregion.H__;!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDUwMvzp3$  
>
> (That code is actually a little out of date from where it has grown inside
> of Morgan Stanley, but it does work and the basic idea is still the same.)
>
> HTH
>
>
>
> On Aug 22, 2022, at 4:01 AM, Jimmy Koppel <jkoppel at mit.edu> wrote:
>
> This is cool! Can you point me to relevant parts of the repo with examples?
>
> On Mon, Aug 22, 2022 at 1:03 AM Kalani Thielen <kthielen at gmail.com> wrote:
>
>> Hi Jimmy,
>>
>> I’ve used a similar observation in this PL/compiler that I made for
>> Morgan Stanley a few years back:
>> https://urldefense.com/v3/__https://github.com/morganstanley/hobbes__;!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDcXor8qL$  
>>
>> … although I didn’t require strict field placement but just
>> convertibility as determined by structural subtyping.  So e.g.
>> ‘{x:int,y:int,z:int}’ is a subtype of ‘{z:long,x:double}’ as witnessed by
>> an inferred function that drops “y”, swaps “x” and “z” while widening them
>> into their respective types.  As a special case it optimizes for ’trivial
>> convertibility’ (where the supertype is a prefix of the subtype) and the
>> identity function is inferred.  There’s a flipped rule for sum types, where
>> it’s always safe to convert into a large subsuming sum (hence the type 0 as
>> the subtype for all types and 1 as the supertype for all types).
>>
>> In this PL, I defined those rules as a system of type classes (ala
>> Haskell), with the additional feature that constraints on qualified types
>> are always eliminated at compile-time (since where this is used at Morgan
>> Stanley, runtime performance penalties have to be minimized).
>>
>> As well as being used to decide compatibility between file formats, this
>> method is also useful to decide compatibility between independent message
>> passing processes.  Previously at Morgan Stanley, we had a lot of software
>> that kept “version numbers” and it could be very difficult to figure out if
>> two “versions” of a message format could be used together (or on a larger
>> scale, if a whole system was even coherent and ready to trade the next
>> day).  Structural subtyping makes that problem much simpler, and also gives
>> you a straightforward explanation for why e.g. two programs can’t
>> communicate.
>>
>> Also on the subject of file formats, beyond just “flat headers”, I found
>> dependent sum types very useful to make “file references” as e.g. “T at f>> is an offset in the file “f” (dynamically determined) where a “T” value can
>> be found (which, if you mmap a large file, you still need to resolve
>> outside of the file contents since different programs will map at different
>> base addresses).  With this it’s easy to build regular recursive types and
>> give type assignments to e.g. btrees and skip lists, decide subtyping on
>> indexes this way, etc.
>>
>> I also added probability distributions to structured types, so again
>> requiring inference of a subtyping witness between convertible structures
>> with different expectations, and used those expectations to encode
>> structures down to their entropy limit — so that we can minimally store
>> billions of ticks split out by symbol in a skip list and efficiently read
>> them out of mapped files concurrently, automatically converting into a few
>> different types for reporting tools and ad-hoc queries.
>>
>> Hope this helps!
>>
>>
>>
>> > On Aug 20, 2022, at 12:07 AM, Jimmy Koppel <jkoppel at mit.edu> wrote:
>> >
>> > [ The Types Forum,
>> http://lists.seas.upenn.edu/mailman/listinfo/types-list ]
>> >
>> > There’s a beautiful fact, not often discussed, that the theory of
>> subtyping
>> > explains forwards compatibility between file formats.
>> >
>> >
>> > For example here’s a simple file format for images:
>> >
>> >
>> > type header_length_v1 = {x : uint32 | x >= 2}
>> >
>> > type width = uint32
>> >
>> > type height = uint32
>> >
>> > type header_v1(n) =  width * height * uint32[n-2]
>> >
>> > type image_data = uint32[]
>> >
>> >
>> > type image_file_v1 = {n: header_length_v1} * header_v1(n) * image_data
>> >
>> >
>> >
>> > This type allows for arbitrary extensions to the header, so that any
>> > extension to this format which adds more header fields is a subtype. A
>> > reader for this file format is thus forwards-compatible with any format
>> > extension which adds more header fields.
>> >
>> >
>> > However, even though I wrote it them above as products, the components
>> of a
>> > binary file format are not products. In particular, they do not allow
>> > nesting, e.g.: (a*b*c)*d is not compatible with (a*b)*d.  Compatibility
>> > between two binary formats is determined  by the same data being in “the
>> > same place.”  Even though, classically, uint32 * uint8 is a subtype of
>> > uint32, one could not replace  width in the type above with (uint32 *
>> uint8),
>> > as that would move the location of the height from bytes 8-11 to bytes
>> > 9-12, breaking existing code.
>> >
>> >
>> > For my class, I made up a “flat sequence type” to describe file formats,
>> > which effectively functions as a product type which must be
>> > right-associated. I would instead write the header(n) type as width ++
>> > height ++ uint32[n-2]. I have not worked out all the theory, but I’ve
>> > noticed fewer students making mistakes about binary compatibly since
>> > introducing this idea.
>> >
>> >
>> > Can anyone point me to related work?
>> >
>> >
>> > Yours,
>> >
>> > Jimmy Koppel
>> >
>> > MIT CSAIL
>>
>>
>
> --
> Jimmy Koppel
> twitter.com/jimmykoppel
> https://urldefense.com/v3/__http://www.jameskoppelcoaching.com__;!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDfN7cHWT$  
>
>
>

-- 
Jimmy Koppel
twitter.com/jimmykoppel
https://urldefense.com/v3/__http://www.jameskoppelcoaching.com__;!!IBzWLUs!XlDKvMF0Rtlm8BY6LF70llbprMX_mzi-SWRHtZeLm1Hecw_ZL4m5moYgxhcK81vMnvy4TlxIOZhS961qp68yDfN7cHWT$  


More information about the Types-list mailing list