Setting up dm-cache on Arch Linux
On the Linux kernel, dm-cache is the device mapper solution to implementing tiered storage. Unfortunately there isn't much documentation on how to set it up, as most resources will guide you towards using lvmcache (which is just metadata on top of dm-cache).
Why dm-cache?
It's extremely flexible if you're willing to put in the effort. You can layer any number of block devices on top of any number of block devices, regardless of their size.
It works well on already existing filesystems, unlike bcache (requires adding a superblock) and lvmcache (requires using lvm). If you're starting from scratch this is still relevant, as it means you always have the option to mount your uncached filesystems as-is without any modifications should you decide to stop using dm-cache, or if the caching device fails (assuming writethrough).
Unlike bcache1, it has no tunables2 and seems to adapt well to changing workloads on its own, making it a close to set-it-and-forget-it setup.
Setting it up
I am assuming you have a filesystem on /dev/disk/by-id/ata-HDD
(a
slow device) and your cache disk is /dev/disk/by-id/ata-SSD
(a fast
device).
First, you need to settle on a block size3, I will use 256
sectors
(128 KiB, sectors are always 512-bytes regardless of actual devices'
physical sector sizes).
Then you need to figure out the size of the metadata. There is no documentation on this that I can find, but this mailing list message4 indicates 4 MiB (8192 sectors) plus 16 bytes per cache block.
BLOCK_SIZE=$(( 128*1024 ))
SSD_SECTORS=$(cat /sys/block/$(basename $(realpath /dev/disk/by-id/ata-SSD))/size)
METADATA_SECTORS=$(( 8192 + 16 * $SSD_SECTORS / $BLOCK_SIZE ))
echo $METADATA_SECTORS
In my case I get 38718 sectors for metadata. To be extra cautious and to avoid potential alignment issues, I round it up to 20 MiB (40960 sectors).
METADATA_SECTORS=40960
Make the two logical block devices for metadata and cache blocks. See
dmsetup(8)
and
dm-linear
documentation for help with linear tables.
CACHE_SECTORS=$(( $SSD_SECTORS - $METADATA_SECTORS ))
dmsetup create SSD-metadata --table "0 $METADATA_SECTORS linear /dev/disk/by-id/ata-SSD 0"
dmsetup create SSD-blocks --table "0 $CACHE_SECTORS linear /dev/disy/by-id/ata-SSD $METADATA_SECTORS"
Erase the metadata zone. The next step may fail with obscure messages
(like requires a block device
) if this area isn't blank.
cat /dev/zero > /dev/mapper/SSD-metadata
Finally, create the cached logical device. Change writethrough to writeback if you understand the added risks.
HDD_SECTORS=$(cat /sys/block/$(basename $(realpath /dev/disk/by-id/ata-HDD))/size)
BLOCK_SECTORS=$(( $BLOCK_SIZE / 512 ))
dmsetup create HDD-cached --table "0 $HDD_SECTORS cache /dev/mapper/SSD-metadata /dev/mapper/SSD-blocks /dev/disk/by-id/ata-HDD $BLOCK_SECTORS 1 writethrough default 0"
We now have created /dev/mapper/HDD-cached
, which behaves just like
/dev/disk/by-id/ata-HDD
but is now benefiting from the cache.
mount /dev/mapper/HDD-cached /mnt/hdd
Automating it with systemd
The steps above will not persist after a shutdown (the cache will), so some automation is needed to have it set up automatically at boot.
In my configuration, I use dm-cache below dm-crypt (to avoid having to
deal with extra encryption/key material for the SSD; dm-cache will
directly cache ciphered data instead). I want the cached device to
play nice with systemd's handling of crypttab
and fstab
devices/mountpoints.
I found little documentation on how to do it, but the solution below
works (/etc/systemd/system/setup-cached-HDD.service
). Change values
as needed. Use systemd-escape -p /dev/disk/by-id/*
to get
dev-disk-*.device
unit names.
[Unit]
Description=setup dm-cached device (HDD-cached)
DefaultDependencies=no
IgnoreOnIsolate=true
Before=cryptsetup-pre.target
BindsTo=dev-disk-by\x2did-ata\x2dHDD.device dev-disk-by\x2did-ata\x2dSSD.device
After=dev-disk-by\x2did-ata\x2dHDD.device dev-disk-by\x2did-ata\x2dSSD.device
RequiresMountsFor=/usr/bin/dmsetup
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/usr/bin/dmsetup create SSD-metadata --table '0 40960 linear /dev/disk/by-id/ata-SSD 0'
ExecStartPre=/usr/bin/dmsetup create SSD-blocks --table '0 249938608 linear /dev/disk/by-id/ata-SSD 40960'
ExecStart=/usr/bin/dmsetup create HDD-cached --table '0 3907029168 cache /dev/mapper/SSD-metadata /dev/mapper/SSD-blocks /dev/disk/by-id/ata-HDD 256 1 writethrough default 0'
ExecStop=/usr/bin/dmsetup remove HDD-cached
ExecStopPost=/usr/bin/dmsetup remove SSD-metadata
ExecStopPost=/usr/bin/dmsetup remove SSD-blocks
[Install]
WantedBy=systemd-cryptsetup@HDD\x2dcached.service
Enable the systemd unit:
systemctl daemon-reload
systemctl enable setup-cached-HDD.service
Here are relevant entries in crypttab
and fstab
, respectively:
# /etc/crypttab: mappings for encrypted partitions
HDD /dev/mapper/HDD-cached /root/luks/HDD.key no-read-workqueue,no-write-workqueue,submit-from-crypt-cpus,noauto,nofail,header=/root/luks/HDD.hdr
# /etc/fstab: static file system information
/dev/mapper/HDD /mnt/HDD btrfs defaults,space_cache=v2,noatime,commit=300,flushoncommit,compress-force=zstd:7,nofail 0 0
Note the nofail
options, which allow the system to continue booting
if a problem happens with the device. If you are are caching your root
(/
) filesystem, you should remove nofail
, use the sd-encrypt
initramfs hook (using /etc/crypttab.initramfs
instead) and add the
dmsetup
binary to your initramfs.
Monitoring the cache
Various statistics about the cache (such as usage, read/write hits and misses, dirty blocks, etc.) can be accessed using:
dmsetup status /dev/mapper/HDD-cached
See the dm-cache documentation for explanations.
Bonus: using multiple devices
It's possible to use dm-crypt with multiple caching and/or backing devices. Imagine you want two SSDs to cache three HDDs :
/dev/disk/by-id/ata-SSD1, 1000000 sectors
/dev/disk/by-id/ata-SSD2, 2000000 sectors
/dev/disk/by-id/ata-HDD1, 40000000 sectors
/dev/disk/by-id/ata-HDD2, 60000000 sectors
/dev/disk/by-id/ata-HDD3, 80000000 sectors
Create one logical "cache device" and one logical "backing device", using dm-linear (if the SSDs have similar sizes, you can also use RAID0 aka dm-stripe with a stripe size equal to cache block size to increase performance and spread the writes more evenly);
Create the logical "cached device" using dm-cache;
Split the logical "cached device" again with dm-linear, back to the three hard disks equivalents.
This method is very versatile, as any block on any caching device can be used for any block on any backed device. However, if you use writeback caching, expect to lose all your data on all your backed devices if any single device fails. This can be mitigated by mirroring (RAID1). Writethrough is much safer, as the backed devices are always in a consistent state; a failing SSD won't cause data loss and a failing HDD won't lose data on other HDDs.
dmsetup create logical-SSD --table "0 1000000 linear /dev/disk/by-id/ata-SSD1 0\n1000000 2000000 linear /dev/disk/by-id/ata-SSD2 0"
dmsetup create logical-HDD --table "0 40000000 linear /dev/disk/by-id/ata-HDD1 0\n40000000 60000000 linear /dev/disk/by-id/ata-HDD2 0\n100000000 80000000 linear /dev/disk/by-id/ata-HDD3 0"
dmsetup create logical-SSD-metadata --table "0 20480 linear /dev/mapper/logical-SSD 0"
dmsetup create logical-SSD-blocks --table "0 2979520 linear /dev/mapper/logical-SSD 20480"
cat /dev/zero > /dev/mapper/logical-SSD-metadata
dmsetup create logical-HDD-cached --table "0 180000000 cache /dev/mapper/logical-SSD-metadata /dev/mapper/logical-SSD-blocks /dev/mapper/logical-HDD 256 1 writethrough default 0"
dmsetup create HDD1-cached --table "0 40000000 linear /dev/mapper/logical-HDD-cached 0"
dmsetup create HDD2-cached --table "0 60000000 linear /dev/mapper/logical-HDD-cached 40000000"
dmsetup create HDD3-cached --table "0 80000000 linear /dev/mapper/logical-HDD-cached 100000000"
mount /dev/mapper/HDD1-cached /mnt/hdd1
mount /dev/mapper/HDD2-cached /mnt/hdd2
mount /dev/mapper/HDD3-cached /mnt/hdd3
-
https://www.kernel.org/doc/html/latest/admin-guide/bcache.html#troubleshooting-performance ↩︎
-
https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache-policies.html ↩︎
-
https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache.html#fixed-block-size ↩︎
-
https://www.redhat.com/archives/dm-devel/2012-December/msg00046.html, from https://blog.kylemanna.com/linux/ssd-caching-using-dmcache-tutorial/ ↩︎