[Unison-hackers] Memory exhaustion issue (#1068)

Sat Nov 23 17:25:25 EST 2024

On 23/11/2024 23:13, Tõivo Leedjärv wrote:
> Hi Michael,
>
> Thank you for the tests so far.
>
> What does unison -version report (it will show the compiler that was
> used, as it may be important in this case)?
unison version 2.53.3 (ocaml 4.14.1)
> For each of the tests you've executed, could you please report the
> total number of files/dirs in the roots? The number of updates is 1
> (per sync), as you already wrote.

Currently counting...

The small profile has some 8,400 items, slightly above 40 G in total.
Archive file size is around 750 k.

The big profile has some 700,000 items, around 350 G in total, archive
file size around 58M.

>> times the size of the file – quite a lot IMO. And for a tool which can
> The size of a synced file does not impact memory usage, so you are
> most likely looking at a bug.
10 times the size of the file refers to the archive file (if that were
10 times the size of the synced file, it would have caused way bigger
problems, much earlier... it‘s not that bad :-)
>> get that memory-hungry, it might be worthwhile to look into ways to
>> reduce memory usage.
>> Or is there a way to tell Unison to stop being smart and just copy the
>> damn thing
> Statements like these don't really motivate those trying to help (for
> you should know that there have been tons of memory-usage improvements
> in the past few years). But yes, there is a way: set the rsync
> preference to false. I don't think it is relevant in this case but you
> can try it and let us know if that changes the memory usage pattern.

Please don’t get me wrong – you guys are doing an awesome job and I
certainly don’t mean to discourage you from what you’re doing. Apologies
if that came across differently.

Scanning files for parts already present at the other end certainly
makes sense in a scenario where bandwidth is more limited than memory –
some of my Unison use cases involve connections with limited bandwidth.
Only in this particular use case, bandwidth is decent (100 M) but memory
is limited, and the bandwidth-saving approach, although well-meant, ends
up backfiring.

In more technical terms, a possible approach would be to detect when
we‘re running out of memory (physical memory – excessive swapping can
quickly cripple a system). If such a condition is detected, skip
duplicate detection on that file (freeing up the memory used for that)
and just copying it, (potentially) sacrificing bandwidth efficiency for
stability. This is roughly what I’m currently doing with -copythreshold,
except that the threshold is determined automatically and adapts itself
to the current situation.