[TYPES] Type theory of file formats

Jimmy Koppel jkoppel at mit.edu
Mon Aug 22 04:01:30 EDT 2022


This is cool! Can you point me to relevant parts of the repo with examples?

On Mon, Aug 22, 2022 at 1:03 AM Kalani Thielen <kthielen at gmail.com> wrote:

> Hi Jimmy,
>
> I’ve used a similar observation in this PL/compiler that I made for Morgan
> Stanley a few years back:
> https://urldefense.com/v3/__https://github.com/morganstanley/hobbes__;!!IBzWLUs!WAhCjaIxpVfUqykQo-tm_T17e9eXap8gH814S1Bs3-XuAVaYnKFOUf-oUA4tNwSkQq5IKR7_9ZrTBo2NETIYGQI-xJ68$  
>
> … although I didn’t require strict field placement but just convertibility
> as determined by structural subtyping.  So e.g. ‘{x:int,y:int,z:int}’ is a
> subtype of ‘{z:long,x:double}’ as witnessed by an inferred function that
> drops “y”, swaps “x” and “z” while widening them into their respective
> types.  As a special case it optimizes for ’trivial convertibility’ (where
> the supertype is a prefix of the subtype) and the identity function is
> inferred.  There’s a flipped rule for sum types, where it’s always safe to
> convert into a large subsuming sum (hence the type 0 as the subtype for all
> types and 1 as the supertype for all types).
>
> In this PL, I defined those rules as a system of type classes (ala
> Haskell), with the additional feature that constraints on qualified types
> are always eliminated at compile-time (since where this is used at Morgan
> Stanley, runtime performance penalties have to be minimized).
>
> As well as being used to decide compatibility between file formats, this
> method is also useful to decide compatibility between independent message
> passing processes.  Previously at Morgan Stanley, we had a lot of software
> that kept “version numbers” and it could be very difficult to figure out if
> two “versions” of a message format could be used together (or on a larger
> scale, if a whole system was even coherent and ready to trade the next
> day).  Structural subtyping makes that problem much simpler, and also gives
> you a straightforward explanation for why e.g. two programs can’t
> communicate.
>
> Also on the subject of file formats, beyond just “flat headers”, I found
> dependent sum types very useful to make “file references” as e.g. “T at f> is an offset in the file “f” (dynamically determined) where a “T” value can
> be found (which, if you mmap a large file, you still need to resolve
> outside of the file contents since different programs will map at different
> base addresses).  With this it’s easy to build regular recursive types and
> give type assignments to e.g. btrees and skip lists, decide subtyping on
> indexes this way, etc.
>
> I also added probability distributions to structured types, so again
> requiring inference of a subtyping witness between convertible structures
> with different expectations, and used those expectations to encode
> structures down to their entropy limit — so that we can minimally store
> billions of ticks split out by symbol in a skip list and efficiently read
> them out of mapped files concurrently, automatically converting into a few
> different types for reporting tools and ad-hoc queries.
>
> Hope this helps!
>
>
>
> > On Aug 20, 2022, at 12:07 AM, Jimmy Koppel <jkoppel at mit.edu> wrote:
> >
> > [ The Types Forum,
> http://lists.seas.upenn.edu/mailman/listinfo/types-list ]
> >
> > There’s a beautiful fact, not often discussed, that the theory of
> subtyping
> > explains forwards compatibility between file formats.
> >
> >
> > For example here’s a simple file format for images:
> >
> >
> > type header_length_v1 = {x : uint32 | x >= 2}
> >
> > type width = uint32
> >
> > type height = uint32
> >
> > type header_v1(n) =  width * height * uint32[n-2]
> >
> > type image_data = uint32[]
> >
> >
> > type image_file_v1 = {n: header_length_v1} * header_v1(n) * image_data
> >
> >
> >
> > This type allows for arbitrary extensions to the header, so that any
> > extension to this format which adds more header fields is a subtype. A
> > reader for this file format is thus forwards-compatible with any format
> > extension which adds more header fields.
> >
> >
> > However, even though I wrote it them above as products, the components
> of a
> > binary file format are not products. In particular, they do not allow
> > nesting, e.g.: (a*b*c)*d is not compatible with (a*b)*d.  Compatibility
> > between two binary formats is determined  by the same data being in “the
> > same place.”  Even though, classically, uint32 * uint8 is a subtype of
> > uint32, one could not replace  width in the type above with (uint32 *
> uint8),
> > as that would move the location of the height from bytes 8-11 to bytes
> > 9-12, breaking existing code.
> >
> >
> > For my class, I made up a “flat sequence type” to describe file formats,
> > which effectively functions as a product type which must be
> > right-associated. I would instead write the header(n) type as width ++
> > height ++ uint32[n-2]. I have not worked out all the theory, but I’ve
> > noticed fewer students making mistakes about binary compatibly since
> > introducing this idea.
> >
> >
> > Can anyone point me to related work?
> >
> >
> > Yours,
> >
> > Jimmy Koppel
> >
> > MIT CSAIL
>
>

-- 
Jimmy Koppel
twitter.com/jimmykoppel
https://urldefense.com/v3/__http://www.jameskoppelcoaching.com__;!!IBzWLUs!WAhCjaIxpVfUqykQo-tm_T17e9eXap8gH814S1Bs3-XuAVaYnKFOUf-oUA4tNwSkQq5IKR7_9ZrTBo2NETIYGZRG-er8$  


More information about the Types-list mailing list