[Unison-hackers] Memory exhaustion issue (#1068)

Sun Jan 5 08:39:41 EST 2025

Michael von Glasow <michael at vonglasow.com> writes:

> Thanks for getting back in touch. I have in the meantime tried to sync
> an even larger set of files between my production systems (about 1.1
> million files, 1.1 TB), and ran out of physical memory already during
> the initial scan. So yes, I was likely near the limit of what a 1 GB
> system could handle, and going from 32 to 64 bits broke things. So, long
> story short, I am planning to upgrade the old Pi 3 and replace it with a
> Pi 5 with 8G of RAM.

What are you doing about swap space?  Does your system page?  If you are
running a RPI3 in aarch64 mode, you could configure 32GB of swap space
on a USB-attached SSD, and then unison should run.  Have you tried that?

>> You might be wondering if ~800 bytes for each file is reasonable or
>> whether the entire archive has to be in memory. For the latter, I can
>> only say that this has not been a limitation and your case is rather
>> extreme.
>
> I’d be curious to know what the “average” use case for Unison looks
> like. Does anyone happen to have an idea of how people out there are
> using Unison?

There are about 1000 people on the users list, so I'd guess there are at
least 10K users.  But I'm guessing.   Unison, being honorable Free
Software, does not contain tracking code to report.

> For my use case, around a terabyte of data in a million files, accessed
> by a handful of users and fairly static, would be well within the means
> of a Pi 3 with 1 GB of RAM – as long as we’re just talking about file
> sharing via CIFS or SFTP. It‘s only when Unison gets involved that the
> system reaches its limit.

CIFS/SFTP is not a fair comparison, because that is access not sync.
You would need to compare to git or rsync.  But, rsync does not detect
changes on either side and merge both, and one needs more metadata for
that.  rsync doesn't keep any metadata at all.  So that's not fair
either.

What happens if you try to check your 1M files in 1T space into git?
Does that work on your 1G RPI3?

> Therefore, one of the questions that came to my mind was indeed if the
> entire archive needs to be in memory – or how memory usage can be
> arranged so that only portions of the archive are in memory. It would
> certainly improve scalability.

You could invent virtual memory and use it :-)  Or you could implement
application-specific virtual memory.   This would look like on-disk
storage for archives, with a cache of objects in memory, and reading
them on demand.

> How is the archive structured in memory? For example, if all lookup
> operations are done by hash code, the obvious memory saving design
> change would be to keep the archive on disk and cache only hashes with a
> certain prefix in memory. Comparisons would then need to be sorted by
> prefix. The prefix length could be dynamic, based on total archive size
> and available memory; memory usage would be reduced by a factor of
> roughly 16 for each hex digit in the prefix. Though, admittedly, I have
> no idea of how complex that is to implement, and I understand no one is
> particularly eager to do it if it just affects one single fellow out
> there. Replacing my hardware with something more powerful is probably
> going to happen more quickly.

Indeed, it gets complicated, and it might not perform fast enough - we'd
get "unison is too slow" vague bug reports instead of "unison uses too
much memory" vague bug reports.

Thank you for digging in and doing tests.  We've had a lot of vague
reports over the years and I feel like this iteration has been far more
useful than most.  The testing results show 730 (or 800, but those
numbers are close) bytes per file, and there are no obvious big wins
available, just maybe squeezing a few bytes, and application-specific VM
schemes.

You are choosing to upgrade hardware, and to me that makes sense.  A 1
GB machine just seems underpowered as a file server.  It was a
reasonable amount of RAM for a budget system in its day, but today, it's
the smallest system I have powered on, by a factor of 4, except for an
oddball 2GB machine.

You could also choose to generate profiles for subparts of the file tree
and run them sequentially.  I have organized files by directory and sync
them separately anyway, because I want to control which directories get
synced to which subset of places, for various reasons not about unison.