[Unison-hackers] Memory exhaustion issue (#1068)

Sun Jan 5 07:47:06 EST 2025

On 05/01/2025 14:06, Tõivo Leedjärv wrote:
> On Wed, 27 Nov 2024 at 20:41, Michael von Glasow <michael at vonglasow.com> wrote:
>> On average, one file occupies around 730 bytes of memory.
> I have run tests similar to what you did and the results are roughly
> the same. This is normal and amounts to Unison building up the
> in-memory database (the "archive") which obviously gets bigger with
> each synced file. I also ran tests with memtrace as suggested by
> Jacques-Henri and found no obvious leaks or bugs (which doesn't mean
> there couldn't be any, just that they're not easy to find).
>
> This increase in memory usage is due to the metadata and is not
> related to the size of synced files; it is a direct function of the
> number of synced items.
>
> What explains the issue suddenly appearing after you upgraded from the
> older Unison version is that you also upgraded from 32 bits to 64
> bits. OCaml values use word-sized blocks in memory. Going from 32 to
> 64 bits basically doubles the memory need for pretty much all the
> metadata, with notable exception of string values (file/dir names and
> more complex props, such as xattrs and ACLs, if you're syncing those).
Thanks for getting back in touch. I have in the meantime tried to sync
an even larger set of files between my production systems (about 1.1
million files, 1.1 TB), and ran out of physical memory already during
the initial scan. So yes, I was likely near the limit of what a 1 GB
system could handle, and going from 32 to 64 bits broke things. So, long
story short, I am planning to upgrade the old Pi 3 and replace it with a
Pi 5 with 8G of RAM.
> You might be wondering if ~800 bytes for each file is reasonable or
> whether the entire archive has to be in memory. For the latter, I can
> only say that this has not been a limitation and your case is rather
> extreme.

I’d be curious to know what the “average” use case for Unison looks
like. Does anyone happen to have an idea of how people out there are
using Unison?

For my use case, around a terabyte of data in a million files, accessed
by a handful of users and fairly static, would be well within the means
of a Pi 3 with 1 GB of RAM – as long as we’re just talking about file
sharing via CIFS or SFTP. It‘s only when Unison gets involved that the
system reaches its limit.

Therefore, one of the questions that came to my mind was indeed if the
entire archive needs to be in memory – or how memory usage can be
arranged so that only portions of the archive are in memory. It would
certainly improve scalability.

How is the archive structured in memory? For example, if all lookup
operations are done by hash code, the obvious memory saving design
change would be to keep the archive on disk and cache only hashes with a
certain prefix in memory. Comparisons would then need to be sorted by
prefix. The prefix length could be dynamic, based on total archive size
and available memory; memory usage would be reduced by a factor of
roughly 16 for each hex digit in the prefix. Though, admittedly, I have
no idea of how complex that is to implement, and I understand no one is
particularly eager to do it if it just affects one single fellow out
there. Replacing my hardware with something more powerful is probably
going to happen more quickly.