[Unison-hackers] Version compatibility

Thu May 6 15:19:22 EDT 2021

I would like to open a discussion for a solution to the issue of version
compatibility: https://github.com/bcpierce00/unison/issues/407

This is a very long email. I hope you find some time to read and really
ponder about it.

PART I - General discussion

To set the scene, this is how I see the problem and the solution approach.
This is a complete view and intended to prepare a solution in all aspects
of compatibility.

**Must-have.** The design of client-server interaction must be both
backwards-compatible and extensible.

**Nice-to-have.** Ability to read archive files (and fingerprint cache
files) in previous format versions.

*An entirely separate discussion, which I will not touch upon here, even
though we may naturally reach this state regardless.*
The entire communication protocol (incl. wire format, RPC protocol, the
API) could have a neutral specification so that other (non-OCaml)
implementations can exist (in theory, at least). The same goes for
archive file format, which could be open for reading by other tools.

## Definitions

Definitions, in context of this discussion only:
- **Backwards compatibility** means the ability of an older version to
  communicate with a newer version without the former having to be changed.
  This does not prohibit breaking changes -- but in that case the newer
  version must include a compatibility implementation for interacting with
  older versions.
- **Extensibility** (also "forwards compatibility"?) means the ability to
  extend a specification or a usage in a backwards-compatible way --
  that is, without breaking previous versions -- while the extended usage
  becomes available to newer versions.

### Example

As an artificial example to the definitions above, let's consider text
encoding in a fictional data exchange scenario. Initially, the encoding is
ASCII and text is not allowed to contain any null-characters (= bytes).

First, we may want start exchanging binary data. Following original
constraints (ASCII only and no null bytes), it can be done by leveraging
Base64 encoding. The original specification is not broken. Its usage has
been extended in a backwards-compatible manner.

Next, we want to add Unicode support. Backwards compatibility is easily
achieved by the new version converting to and from ASCII but this alone
does not allow getting any benefits of Unicode. To extend the data exchange
scenario and support Unicode without any backwards-incompatible changes,
we can use any encoding that technically abides by original constraints
(and this includes the aforementioned Base64). Luckily for us, there is
a much more suitable encoding already widely used and partly designed with
this specific compatibility in mind: UTF-8. New verions can use Unicode
without breaking ASCII-only old versions. On the other hand, using UTF-16
or UTF-32 is not compatible with the original specification (unless
wrapped in Base64 or similar).

## Layers and status

For the discussion, I want to go through the following layers (from bottom
to top):
1. Wire format (data encoding)
2. RPC protocol
3. Application data encoding (currently only for the streaming protocol?)
4. Application/API

- **Wire format (data encoding)**
  - Extensibility here means that the format can encode different-from-
    current data structures used at the RPC and application layers.
  - The currently implemented wire format (OCaml's internal data
    marshaling) is well known to be not backwards-compatible (due to
    dependency on OCaml version):
    https://github.com/bcpierce00/unison/issues/375

    Being both backwards-compatible and extensible is not a problem with
    pretty much whatever format is used, as long as it is designed for data
    exchange or storage and is general purpose (not domain-specific, not
    one fixed schema).

    WIP format by Stéphane, and any others, like BSON, JSON, etc., are
    inherently suitable for enabling backwards compatibility and being
    extensible. Anything capable of encoding primitive data types will do.
  - There is work in progress by Stéphane to remove dependency on OCaml
    version: https://github.com/glondu/unison/tree/umarshal [^1]

    I believe this work will also satisfy extensibility requirement at
    this layer.

- **RPC protocol**
  - Backwards compatibility is required for the RPC wire protocol, not for
    the RPC logical (stub) interface towards the application.
  - Extensibility here means that the RPC protocol can be used for different
    functions used at the application layer, potentially in a different
    manner (for example, by introducing no-response "procedure calls").
  - The currently implemented RPC protocol is conceptually very simple -
    it provides some flow control mechanism and request-response
    communication plus a streaming protocol for transferring file contents.

    There is no IDL or pre-defined schema, functions are registered
    dynamically and referenced by name (thus completely transparent to the
    RPC mechanism). Exactly one function argument is supported, but it is
    opaque to the RPC mechanism and can therefore be any structured data
    type (like a tuple, record, and so on), effectively enabling function
    arguments that are unlimited in number, type and complexity.

    RPC wire protocol itself (with the tiny exception of message ID) and
    its payload are by default encoded in the same wire format. It is
    possible to supply other marshaling functions, which the streaming
    protocol does, for example. This way, any data encoding done at
    application layer can be used directly as on-the-wire format for the
    payload, or it can be encoded further in the default wire format.

    There is a version handshake at remote connection opening. It is not
    a version negotiation, though, so while it allows for clean breaking
    changes, it is unable to provide backwards compatibility.
  - To my knowledge, there are no specific issues with the RPC protocol.
    It does lack easy extensibility, but that should not be a problem
    because extensibility can be delegated to the application layer.

- **Application data encoding**
  - I think currently only the streaming protocol uses its own very simple
    data encoding format that is different from the common wire format.
    I will not include this layer separately in further discussion, as it
    will be covered by the same solution as the API layer.

- **Application/API**
  - Backwards compatibility at this layer involves many aspects, like the
    API functions, types and data structures, checksum algorithms, the
    set of preferences, and so on. In summary, the "contract" between
    client and server must be backwards-compatible.
  - Extensibility of the "contract" means ability to evolve features, add
    completely new features, replace implementations and algorithms, and
    remove old features.
  - Current implementation at this layer lacks backwards compatibility and
    extensibility entirely -- any change that impacts both client and
    server is inevitably a breaking change.

[^1]: There is an interesting library which seems to be very similar to what
Stéphane is doing: https://nomadic-labs.gitlab.io/data-encoding/

### Archive and fingerprint files

Issue at: https://github.com/bcpierce00/unison/issues/377

Archive and fingerprint files can pretty much follow the discussion at data
encoding and application/API layers, so I will mostly not touch on these
files specifically in the further discussion, beyond a few thoughts here.

*Reading* archive files in a non-compatible format is not a must-have as
these files can be safely deleted or ignored (at the expense of having to
scan the entire replica again). The same is true for fingerprint cache
files. It may be done on a best effort basis but is not a requirement.

It follows from above that being able to write an archive file in an old
format is not required. The only must-have is this: The checksums of
archives at synced replicas must match. Important here is that this is
not a checksum of on-disk file but a checksum algorithm run over the
data structure in memory.

In local replicas scenario, also when both client and server have upgraded,
there already is compatibility and the issue does not exist.
In a remote-replica scenario, archives at both ends must have the same
checksum. This is easily (and necessarily) achieved with the client-server
compatibility and extensibility features, hence no need to discuss archive
files separately.

Regardless of all compatibility features, when the file fingerprinting
algorithm changes then a complete rescan is unavoidable.

## Proposal

- **Wire format (data encoding)**
  - There is already work in progress. We just need to make sure it
    continues and completes. We could have a discussion about whether using
    a library makes more sense (see footnote above) (I am aware that
    Stéphane has tried a couple of libraries but dropped them for various
    reasons).

- **RPC protocol**
  - I don't think any major changes are needed at this layer. To enable
    easy extensibility, I propose to remove Unison version check at remote
    connection opening, replace it with an actual protocol version and add
    a simple version negotiation method. Here, I do not propose to version
    the API itself, just the RPC mechanism and wire protocol.
  - I have a working prototype and will soon create a PR for review.

- **Application/API**
  - Clearly some kind of versioning is needed here. Instead of using an
    actual version number, I propose a solution based on negotiation of
    capabilities (or, well, features).
    See the details of this proposal below.
  - Why not negotiate RPC version and application features together? I have
    concluded that separation of these layers is beneficial and does not
    increase complexity (on the contrary). The RPC version negotiation can
    be kept as simple as possible. Further, it will be possible to have a
    completely different RPC mechanism which does not have any version
    negotiation or has implicit version built into the protocol.
  - To fully benefit from the extensibility enabled by feature negotiation,
    the archive file on-disk format must be structure-aware and/or record
    the structure together with data. Simple unmarshaling to a fixed in-
    memory structure will no longer work and a more dynamic parsing is
    required.
    - The wire protocol work by Stéphane is most likely not sufficient for
      this purpose as it expects prior knowledge of exact data type and
      in-memory structure to unmarshal encoded data. Then again, since all
      marshaling and unmarshaling functions are written from scratch, there
      is nothing stopping us from encoding extra structure or type
      information into the output.
  - I plan to do a PoC for feature negotation proposal soon.

PART II - Features negotation

# Introduction and motivation

"Features" is a set of feature names supported by a specific version of
Unison implementation. Over time, each incompatible change -- whether
mandatory or an optional add-on -- is assigned a unique feature name.

Features allow a client to connect to and properly work with a server of
different version, older or newer. When setting up the connection, both
server and client negotiate a commonly supported set of features.

Using features instead of a version makes the implementation agnostic of any
versioning schemes, forks and third party implementations. It also allows
for more flexible code changes over time, without the code being polluted by
adding more and more conditionals for various version combinations, such as
"if version < X then", "if version >= Y and version < Z then", and so on.

# Negotiation

Feature negotiation takes place immediately after the RPC connection has
been fully set up.

1. Client sends its full feature set to the server.
2. Server validates the intersection of its and client's feature sets.
   - If error then server sends NOK to client. The client closes connection.
3. If OK then server sends intersection of feature sets to the client.
4. Client validates the intersection.
   - If error then client closes connection.
5. If OK then the negotiation is complete and both server and client will
   use only features fully supported by both.

## Feature registration

A feature is added to the set by registering it. This can be done by any
part of the code that "owns" a feature, similar to how user preferences are
registered.

Registering a feature requires a unique feature name and an optional
validation function.

## Intersection validation

Each feature can provide a separate validation function. When validating
the intersection of client's and server's feature sets, validation
functions for each included feature are run in arbitrary sequence.

A validation function will be able to see the entire intersection and can
freely decide whether the intersection is ok or not. Examples of possible
validation scenarios:

- A mandatory feature is not in the intersection
  - This typically means that counterparty is too old, but could also mean
    that the counterparty is too new and the feature has been removed.
- User preference enabled for a feature not in intersection
- A feature depends on another feature not in intersection

Some features in the intersection can conflict with each other. This can
happen for example when two different implementations of a function are
both supported but must not be used simultaneously. All such conflicts
are benign in nature and will not cause feature intersection validation
to fail. (Since the intersection is a subset of the entire feature set
then failing a conflict would mean that the set of features is conflicting
to begin with.)

# Development

Every incompatible code change must result in a change to the set of
features:

- New code that is mandatory to use (effectively breaks compatibility
  with older versions despite feature negotiation) ->
  - Register a new feature with a validation function that rejects any
    feature intersection that does not include this feature
- New code that is optional to use ->
  - Register a new feature
- New code that replaces existing code ->
  - Register a new feature and remove one or more features
- Remove existing code ->
  - Remove one or more features
- Code is not removed but it can be deprecated ->
  - Add or change a validation function to output a deprecation warning

## Code evolution, conflicting features

With features, new code does not have to replace existing code even if
they seemingly conflict. Both an existing feature and a new feature can
co-exist. The code must be guarded by checking which features are enabled
at runtime for each remote connection.

For example:

- Existing code implementes feature hash-1.
- New code implements a new hashing algorithm and adds feature hash-2.
- Even though two different hashing algorithms must not be used at the
  same time, both implementations can co-exist as in the following
  pseudocode example.

```
function hash
  if (feature hash-2 enabled) then
    new algorithm
  else if (feature hash-1 enabled) then
    previous algorithm
  end
```

- If both server and client support hash-2 then the new implementation
  will always be used, even if both server and client also support hash-1
  at the same time.
- If either server or client does not support hash-2 then the feature
  intersection will only contain hash-1 and the previous implementation
  will be used.

Now let's imagine that in addition to hashing algorithm changing with
the new feature, also the result type changes. This is trickier to implement
but clearly not impossible.

There are multiple ways of handling parallel implementation of conflicting
types. These are not the topic of this document, but a few possibilities
are provided for inspiration:

- Abstract types and type variables
- Variant types (aka sum types)
- Extensible variant types
- First class modules
- GADTs
- Classes/objects

### Archive file

Most changes will ultimately result in type changes. This will directly
impact data encoded in wire format and stored in archive file format.

Data on the wire is transient. As both client and server have agreed on
a common feature set, they know how to marshal and unmarshal data on the
wire without any issues.

Data in the archive file is persistent and could have been written while
a different set of features was agreed upon. There are a couple of ways
to read and write archive files in this scenario:

- Not even attempt to read an incompatible archive file. The exact used
  feature set is written into the archive file. As long as both client and
  server keep negotiating the same feature set, they can read existing
  archive files. When the negotiated feature set changes (due to upgrades),
  the previous archive files can be ignored (requires a complete rescan).
  This may be acceptable, as such upgrades are assumed to be quite rare.

- A subset of the used feature set is written into the archive file. Only
  features that change the data structures written in the archive file are
  stored in the file. The reading can work in two ways. Either as a slightly
  more forgiving variant of the point above, or actually reading and
  unmarshaling the archive according to the features used to write it --
  even if not all the same features are included in the currently negotiated
  feature set. The latter requires types and code be tailored for this, the
  same as with the next point below.

- The archive file on-disk format includes information about the types and
  structure of the written data (you can think like a DB with a relatively
  dynamic but still typed schema). The data can be read back selectively,
  and even converted as necessary (for example, can read a stored int32 into
  in-memory int64).
  The selective reading can mean two things. First, the archive was written
  with a feature that is no longer enabled. The data that was only relevant
  to that feature is just skipped. Second, the archive was written without
  a feature that is now enabled. For the newly-enabled feature there is no
  data in the archive but this does not break reading the file, as long as
  the new feature can deal with default or "empty" values for its data
  structures.

# Some examples of code changes possible with features extensibility

Most of these examples currently exist as feature requests or PRs. They
could be implemented in a backwards-compatible way.

- Change the hash algorithm
- Change the fingerprint digest algorithm
- Add capability to support multiple different checksums
- Add capability to sync by fs-level checksum
- Add ACL, xattrs syncing capability