[Unison-hackers] Character Encoding issues in filenames

Gregg Tavares unison at greggman.com
Thu Dec 20 16:45:27 EST 2007


Sorry if I'm a noob and this has been covered.

I recently tried to use unison to sync files with Japanese filenames between Linux (fc6) and XP and it didn't work. I attempted to look into it hoping I could fix it and contribute. I think I know what the problem is but unfortunately I couldn't think of an easy way to fix it.

I thought I'd post about it and maybe another developer will have some ideas.

I'm running XP as the client, Linux as the server and I'm in text mode. Analysing the output it's pretty clear that unison is getting UTF-8 filenames from Linux but jis or iso-2022-jp on Windows.

This is a problem on the Windows side. Windows has 2 sets of APIs for most functions, The Widebyte 16bit unicode versions and multybyte 8bit localized versions.  Ocaml is using the multibyte versions (which is the default in windows and the only ones you can pass 8bit strings to). Those multibyte versions always return/accept strings in the locale of the OS. (my OS is set to Japanese as it's non-unicode locale.)

I was hoping I could just find a way to set the locale to UTF-8 but searching the net it sounds like Microsoft got rid of that ability. If I could do that then I could set the locale to UTF-8 eithre using a shell around unison or inside unison itself which would make the multibyte API functions except / return UTF-8 and everything would work. Since apparently that is not possible in Windows then the only other solutions I can think of are

#1) some how get ocaml or an extension library that will convert UTF-8 to/from UCS-16 and call the Widebyte Win32 API functions inside unison

This seems unlikely

#2) have unison call the conversion functions to convert the UTF-8 filenames passed in from/to Linux to/from the local encoding in the correct places and visa versa

The problem with this method is any filename sent from Linux that has a character that doesn't appear in the current encoding set in Windows will screw up.

Anyway, that appear to be the issue. I hope someone finds a solution. As it is unison will not sync between windows and linux (or windows and osx) for many foreign characters :-(

-Gregg Tavares



More information about the Unison-hackers mailing list