Image Compression Revisited

October 24, 2023, Joachim Folz

We previously tested container startup times with different Squashfs compression settings. We now revisit this topic from the perspective of users importing from Docker Hub or creating custom images.

Why revisit compression? #

Our previous tests focused on settings for the clusters’ periodic image import service, mostly from the NGC registry. We used the existing nvcr.io_nvidia_pytorch_21.05-py3.sqsh image file to measure compressed size and container start times with different compression methods. Since this process would run roughly once a month, when new images are released, time to export was not a concern. Plus, since each image would likely be used thousands of times, saving a few seconds on container startup is well worth any extra time spent compressing the image.

However, when users create custom images, waiting for half an hour to save a second later is probably not worth it, and there is likely a different sweet spot for compression settings.

Test setup #

Since we want to focus on users creating Squashfs images, we will import pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel from Docker Hub. Notably, this image is considerably larger at over 15 GiB uncompressed, compared to about 12 GiB for the image used for the previous test. Since compression only affects the export step, all time measurements are for enroot export only. In practice, downloading and unpacking the image naturally adds some extra time, so the whole process will take a bit longer than what is reported here.

Compression methods #

We first compare no compression against lz4, lz4hc, and zstandard (zstd) level 1.

Compression settings for enroot are controlled via the ENROOT_SQUASH_OPTIONS environment variable. We use the following values for our tests:

none: "-noI -noD -noF -noX"
lz4: "-comp lz4"
lz4hc: "-comp lz4 -Xhc"
zstd: "-comp zstd -Xcompression-level $lvl"
(replace $lvl with the desired compression level)

Compression	Threads	Utilization	Export time	Size (%)
none	2	0.9	43.4s	99.7
none	4	1.0	44.0s	99.7
none	8	1.0	42.7s	99.7
lz4	2	1.9	45.6s	68.5
lz4	4	2.8	31.5s	68.5
lz4	8	2.8	30.8s	68.5
lz4hc	2	2.0	33m35.0s	60.4
lz4hc	4	3.9	16m59.8s	60.4
lz4hc	8	7.9	8m21.5s	60.4
zstd lvl 1	2	2.0	1m20.6s	55.4
zstd lvl 1	4	3.5	43.7s	55.4
zstd lvl 1	8	6.0	25.7s	55.4

The uncompressed Squashfs is 99.7% of the original size, since some duplicate files were removed. lz4 is as fast or faster than no compression. While it was a good option for the clusters’ image import (especially since some nodes did not support zstandard at the time), lz4hc is unusably slow. Zstandard offers by far the best compression. With 2 threads, it takes about twice as long to export, but it catches up with 4 threads, and ends up fastest with about 26 seconds when using 8 threads. An average utilization of 6 threads also suggests we are starting to hit some sort of IO bottleneck, so we would probably not benefit from more threads.

Settings for zstandard #

Since it provides a good mix of compression and speed, we will look a bit further into zstandard. So far, we have used compression level 1. Zstandard supports up to level 19 (and 22 in ultra mode), and unlike some other compression formats like xz, memory usage during decompression is constant, so it is safe to use very high levels. Below is a plot that shows export time vs. compressed size for zstandard level 1 to 9, using 2, 4, or 8 threads.

export time vs. compressed size zstd 2, 4, 8 CPUs

Here we again clearly see how zstandard scales nicely with the number of threads used for compression. Unfortunately, it is also quite obvious that higher levels do not provide much of a benefit. While we could save another 5%, export time increases 8-fold. This is especially bad, considering that we can get about 4% at level 4 for just a 2-fold increase. For most use cases, however, level 2 has to be the winner. We get a bit more compression for almost no extra cost, so this will soon become the default setting for users on our cluster.

Duplicate removal #

Originally, the enroot developers recommended uncompressed Squashfs and disabled duplicate removal. We already discovered that compression provides a big benefit, both in terms of file size and container start times, so what about duplicate removal? Maybe we can gain some more speed. The following plot again shows export time vs. compression ratio for levels 1 to 4, with and without duplicate removal.

export time vs. compressed with and without duplicate removal

The results are not terribly exciting. Our main takeaway is that duplicate removal has essentially no impact on export time. We also did not observe any meaningful difference in memory usage, which is welcome. At least for the image we used compression also does not improve by a lot, but there is at least some positive effect. About 4000 duplicate files are removed.

What about really high levels? #

We currently still use lz4hc for the clusters’ image import. Since we can spend more time here, which level should we be using? Let us take a look what happens if we go all the way up to level 19.

export time vs. compressed size zstd level 1-19

With 2 threads it takes over 2 hours to export at level 19. Ouch. With 8 threads we are still looking at about 31 minutes, which is quite silly. However, while there is little improvement from level 9 to 13, there is a rather large drop in size at level 14. We save about 950 MiB compared to level 1, and 1700 MiB over lz4hc, which should speed up container start by a few seconds. We will probably use 15 for a bit of extra safety, in case this behavior is unique to the image we used for our testing. Incidentally, level 15 was one of the settings we used for our previous tests as well.