[Unison-hackers] [unison-users] Re: Broken unicode handling in unison 2.27.57

Sat May 2 22:55:56 EDT 2009

> Benjamin Pierce wrote:
>> I'm not an expert on unicode, character set, or internationalization
>> issues, so I'm afraid I can't be much use here.
>
> I guess I could provide the required input here. I mainly need a guide
> around the Unison codebase and some OCaml features, so that I can find
> out points of interest quickly without understanding the bulk of  
> Unison.

I can try to help with this, but probably best if you post to unison- 
hackers instead of emailing me directly, since there are other people  
there that may be able to help if I'm away from mail.

> Crash course for the issues at hand, not only for you, but for future
> reference as well. SKip this for now if you want.
> ...

Very useful -- thank you for all that.

>> But, as I've heard from
>> other people and as you comment below, a clean solution seems to  
>> require
>> changes in many places, which to my mind means beginning with a paper
>> design that specifies what behavior is intended in all cases.
>
> In the long run, it might be good to have such a paper in some place
> other than this mailthread, e.g. on launchpad (as a blueprint) or in
> some wiki. For now, I'll simply start here.
>
> What we have:
>
> A) Windows
>   * case insensitive (though case preserving)
>   * UTF-16 file names => full unicode support
>   * Unicode-enabled and legacy system calls
>   * no Unicode normalization enforced, NFC customary
> B) Linux
>   * case sensitive (on most mount types)
>   * Octet-based file names => interpretation not fixed
>   * encoding derived from environment (LC_CTYPE, setlocale(3))
>   * no Unicode normalization enforced, NFC customary
> C) OS X
>   * case insensitive (case sensitive variant available as well)
>   * Octet-based file names, encoded using UTF-8 NFD
>   * Unicode normalization using NFD enforced
>
> What do we want? I'll include the case issues along with the
> normalization issues in order to draw parallels.
>
> 1. No unneccessary modifications between systems with the same
> capabilities. Upper-/Lowercase names which are equal in terms of case
> should be transported all right between case-sensitive systems.  
> Mixtures
> of precomposed and decomposed glyphs should be transferred without
> modification between systems not enforcing NFD.
>
> 2. Error messages for conflicting names. Synchronizing two names
> differing only in case to a case insensitive system will cause an  
> error
> to be printed. Synchronizing two names with the same normal form to a
> system enforcing NFD will cause an error to be printed.
>
> 3. New files created to target policy. When synchronizing a new file
> from an NFD macintosh to some other system, the file should be created
> in its NFC form, which follows custom and makes access to the file
> through the user interface easier.
>
> 4. Existing files keep their name. When synchronizing between two  
> hosts,
> one of which enforces NFD and the other does not, and when there is  
> one
> file on each system such that the normal forms of the names are equal,
> then the contents of the files should be synchronized, and the file
> names left as is, even if one of them is not normalized at all. The  
> same
> should hold for case, by the way, but I'm not sure if it does.
>
> 5. Unmappable characters cause an error. If the target system doesn't
> use Unicode, and a file to be synchronized contains some character
> outside the supported charset of the target system, then an error  
> should
> be reported.
>
> 6. The Graphical user interface should correctly display unicode
> characters. This might involve some investigation of the underlying
> libraries and the corresponding OCaml bindings.
>
> 7. The Text user interface should correctly display unicode  
> characters.
> I'm not sure how much trouble it is to turn a windows command prompt  
> to
> unicode mode from within an OCaml program. If it's too much effort,  
> some
> kind of replacement character might be printed instead.
>
> 8. The filesystem should be accessed using either standard  
> interfaces of
> the underlying platform, or some (portable or OCaml-friendly)
> implementation that behaves in the same way. This means Unicode  
> Windows
> API calls, setlocale(LC_CTYPE, "") on Linux, and probably some Cocoa
> stuff for Macintosh.
>
> 9. Normalization should at least generate valid HFS+ names. The HFS+
> standard contains its own description of a NFD normalization  
> algorithm,
> complete with full replacement tables. This is basically a frozen
> snapshot of the Unicode NFD specification, which will guarantee that  
> the
> set of valid file names won't change as Unicode develops. As an  
> absolute
> minimum, emplying those conversions will result in valid names and  
> thus
> allow the file to be stored under that name. Using some evolving NFD
> implementation, like camomile probably does, has the benefit of
> following a platforms policy even when it's not enforced, and thus
> (according to 3. above) is preferable to a minimal normalization. In
> that case, duplicate files with same normal form (as discussed in 2.
> above) might occur even on the Mac side of a synchronization.

This all seems reasonable, though obviously there are many issues to  
be considered and I don't feel I understand them deeply.

One thing that I would add is that there should be a switch that  
completely disables all handling of unicode, etc., and produces  
exactly the current behavior.

>>> Things that I can think of might require improvement:
>>> 1. The position of the change. Is Case.normalize the correct place?
>>> 2. The depndency. Is using camomile acceptable, or do we require  
>>> our own
>>>  implementation of unicode normalization?
>>> 3. Use of findlib. While I guess the use of findlib for camomile  
>>> makes
>>>  the build more portable, it might be cleaner to switch the whole
>>>  unison build to findlib. On the other hand, if you want to keep  
>>> build
>>>  time deps to a minimum, findlib shouldn't be used at all.
>>> 4. The handling of compilation alternatives. Is providing two files
>>>  "unicode.ml" in two different directories an acceptable way to
>>>  provide and link to optional code?
>>
>> I can only comment on 2, 3, and 4 at the moment:
>
> Sad, the placement of the hook is my primary concern right now. Does  
> the
> above crash course and specification draft enable you to provide  
> useful
> pointers as to where this behaviour should be placed? Otherwise I  
> guess
> I'll simply have to have a closer lok as how case sensitivity is
> handled, and try to duplicate parts of that.

One problem here is that the case insensitivity handling was mostly  
coded by other people, and I have never understood the details  
completely.  Hopefully some of them can chime in.

Case.normalize does seem like an appropriate place, but there also  
seems to be some trickiness in the Path module, which distinguishes  
"global" (case-normalized) and "local" (case-insensitive iff local  
replica is case-insentitive) paths in the replica.

Another thing to look at is how filenames are transferred across the  
network when new files are being created.

Another is the way filenames are printed when they are being passed to  
external tools like merging or remote-copying programs.

>> For 4, providing two alternate unicode.ml files seems reasonable.
>> Another alternative that may be reasonable is to include a snapshot  
>> of
>> the camomile distribution in the unison distribution.
>
> Would be feasible in terms of license, as camomile is LGPL-2. The size
> of camomile, in terms of tar bundle file size, is several times that  
> of
> unison, though, so I don't know if you want to add that to unison.

The size is not such a big deal -- I mainly worry about dependencies  
and about complicating the build process.  It seems that camomile is  
pretty much a standalone package, so adding it to Unison might not be  
bad, but perhaps it's safest from the point of view of stability to  
keep camomile outside of unison and make sure the unison build process  
works whether or not camomile is present.

>> If someone (or a group of people) steps up and volunteers to design  
>> and
>> implement a clean solution, and if partial versions need to be stored
>> someplace while the project is underway, I'll be happy to discuss
>> finding a home for them either in a branch of the unison repository  
>> or
>> in a separate repository on U. Penn's svn server.
>
> As I said, I think I can implement the normalization stuff, but I'll
> need some guide around the Unison codebase, else it will take more  
> time
> than I have to get my bearings. Some kind of instant messaging contact
> or someone present regularly on IRC would be great.
>
> I like Russels suggestion about Launchpad. Before I start pushing my  
> own
> branches there, I think it would be a good idea to register Unison  
> as is
> with launchpad. Maybe it would be better if some core developer  
> would do
> this, so that it dosn't look like it's my project.
>
> It would also be good to have a clone of the subversion repository on
> launchpad as a bzr branch. Theoretically, launchpad has a feature to
> import subversion repositories on a regular basis, but they use some
> one-way mechanism, which doesn't allow the changes from other branches
> to be merged back into the subversion tree. The bzr-svn plugin does
> provide this functionality. Would it be possible to place a post- 
> commit
> hook on the subversion server, in order to push each commit to
> launchpad? Or have a cron job do this? If you want, I can figure out  
> and
> write down required commands from the bzr side of things.
>
> I guess it would be good to have a group called unison on  
> subversion, so
> that the branches can be associated with that group, instead of
> individual developers. I believe this should also be the foundation of
> branches with write access for multiple people, but I haven't tried  
> that
> yet. I'd be happy to be part of such a group.

I'm willing to help with repository issues, but I'd prefer to wait a  
little till it's clear that this project is making progress before  
sinking a lot of time into setting things up.  Would it make sense  
just to take a copy of the sources, put it somewhere convenient for  
collaboration among whoever is interested in this, let things run for  
a little while, and then synchronize the two replicas and set up a way  
of keeping them in sync?

Best,

      - Benjamin

P.S.  Since the discussion is getting pretty technical, I suggest we  
move it to the unison-hackers list.  I'll cross-post this there so you  
can just "reply all" and then edit headers.  (You'll need to sign up  
for that list, but you should do that anyway, since it's where commit  
logs get sent.)