Image Compression, Part 3
October 27, 2023, Joachim Folz
Thought we were done with compression? So did we, but as it turns out, we overlooked a Squashfs parameter that is maybe even more important than compression settings: block size.
The final piece of the puzzle #
So far, we ran two sets of benchmarks to find the most suitable system level and user level compression settings. However, we overlooked an important setting: block size.
Squashfs is a read-only file system image format for Linux with optional compression. To let us access individual files without decompressing the whole image, data is organized in blocks. Each block is compressed individually, meaning we only need to decompress blocks that contain the file we want to access. The default block size is 128k bytes, but Squashfs supports any power of two between 4k and 1M.
Test setup #
We use the same setup as for our
last post:
import pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
from Docker Hub
and measure various metrics like export time and compressed size.
Since compression gets more challenging for shorter sequences,
we will only look at block sizes of 128k, 256k, 512k, and 1M.
We already determined that zstandard outperforms lz4,
so we will only look at zstandard this time.
We will also limit testing to compression levels 1 to 15,
since higher levels were too slow to be useful.
To apply these settings, we set the environment variable
ENROOT_SQUASH_OPTIONS="-b $bs -comp zstd -Xcompression-level $lvl"
(replace $bs
with block size in bytes and
$lvl
with the desired compression level).
All tests will use 8 threads to speed up testing.
Since compression time scales almost linearly with the number
of threads, simply multiply by 2 or 4 to get a close estimate
for how long it would take with fewer threads.
Block size comparison #
Let us first look at a plot that shows export time vs. compressed size for different compression levels. The four series show the results with different block sizes.
It is immediately clear that larger block sizes greatly improve compression. The smallest size achieved with the default block size of 128k is 49.4% at level 15. If we use 1M instead, level 2 already compresses to 48.4%. Our new record is 44.2% from level 12 onward.
While we observe that block size has some effect on export time, there is no discernible trend. Some combinations of block size and level are faster than others, but you cannot predict what will happen if you change either. The overall conclusion is still clear, though: The largest block size of 1M is Pareto optimal for levels 2 to 15.
Below is a selection of interesting values in tabular form.
Block size | Level | Utilization | Export time | Size (%) |
---|---|---|---|---|
128k | 2 | 5.2 | 21.0s | 54.0 |
128k | 3 | 5.3 | 28.9s | 53.3 |
128k | 12 | 7.9 | 2m3.2s | 50.9 |
128k | 15 | 7.9 | 4m34.9s | 49.4 |
256k | 2 | 5.3 | 23.6s | 51.8 |
256k | 3 | 6.7 | 28.6s | 50.0 |
256k | 12 | 7.9 | 3m0.1s | 48.5 |
256k | 15 | 8.0 | 5m46.2s | 47.1 |
512k | 2 | 5.2 | 19.6s | 50.4 |
512k | 3 | 6.6 | 24.5s | 48.6 |
512k | 12 | 7.9 | 2m12.2s | 46.6 |
512k | 15 | 7.9 | 3m29.1s | 46.6 |
1M | 2 | 5.5 | 18.4s | 48.4 |
1M | 3 | 6.9 | 23.9s | 46.4 |
1M | 12 | 7.9 | 2m40.6s | 44.2 |
1M | 15 | 7.9 | 4m11.2s | 44.2 |
Container creation times #
Finally, let us look at what these improvements to compression mean from a user perspective. We do not really care how big our container images are. Smaller files are nice to have, but what we really want is faster container creation times. How long it takes to create a container determines how responsive the cluster feels. We still need to verify that the compression settings we found actually help in that regard. According to our statistics, the most common configuration for jobs is currently 4 CPU threads, so let us look at that first. Below is a plot of compression level vs. creation time for different block sizes. Lines represent the median of 5 runs and the filled sections show the observed value range.
These results are somewhat unexpected. Much more than compression level, block size greatly influences container creation times, and it is immediately obvious which value to use: bigger is better. With block size 1M container creation takes just 16.5 seconds, almost 10 seconds faster than with 128k. It is even 5 seconds faster than our previous time for 4 threads, despite using a 25% larger image. For our use case, a block size of 1M is optimal, so we will adopt it as our default setting. The compression level still overall improves creation times, but only marginally. Image size may become more relevant under high file system and network congestion.
Below is another selection of interesting values in tabular form. You will also find plots for 2 and 8 threads at the end of the post.
Threads | Block size | Level | Creation time | Utilization |
---|---|---|---|---|
2 | 128k | 2 | 36.5s | 1.2 |
2 | 128k | 15 | 38.4s | 1.3 |
2 | 256k | 2 | 33.5s | 1.3 |
2 | 256k | 15 | 35.3s | 1.3 |
2 | 512k | 2 | 29.6s | 1.3 |
2 | 512k | 15 | 30.0s | 1.3 |
2 | 1M | 2 | 26.4s | 1.5 |
2 | 1M | 15 | 26.0s | 1.4 |
4 | 128k | 2 | 25.9s | 1.7 |
4 | 128k | 15 | 25.3s | 1.9 |
4 | 256k | 2 | 23.1s | 1.8 |
4 | 256k | 15 | 22.9s | 2.0 |
4 | 512k | 2 | 21.8s | 1.7 |
4 | 512k | 15 | 20.4s | 1.8 |
4 | 1M | 2 | 16.7s | 2.2 |
4 | 1M | 15 | 16.5s | 2.2 |
8 | 128k | 2 | 22.5s | 2.0 |
8 | 128k | 15 | 20.8s | 2.4 |
8 | 256k | 2 | 21.4s | 1.9 |
8 | 256k | 15 | 19.8s | 2.3 |
8 | 512k | 2 | 20.0s | 1.8 |
8 | 512k | 15 | 18.9s | 1.9 |
8 | 1M | 2 | 15.3s | 2.3 |
8 | 1M | 15 | 15.1s | 2.3 |