Piotr Pawłow
2017-05-06 22:55:58 UTC
Hello,
TL;DR: should LVM cache work properly with suspend to disk?
I'm using LVM cache for my /home volume since 2015 and had no troubles
with it, until I started to suspend my computer to disk every day.
At least once per 2-3 weeks the machine would lock up a few seconds
after resuming, and after reset it would usually work again, but
sometimes the cache would end up badly corrupted, unrepairable, and I
had to remove the cache. I had to switch to writethrough mode, as
removing writeback cache would leave the file system too corrupted to
repair.
I thought the problem was caused by initrd, as it activates volumes and
scans devices for filesystem superblocks before resuming. I thought
reads may cause blocks to be cached and modify the cache, and then the
resumed system doesn't know that something has changed and gets confused.
I attempted to fix it by adding "resume=/dev/sda1" to kernel command
line, where sda1 is my swap partition, which makes it resume without
starting initrd.
I switched to writeback again, and it was working fine for 6 months,
maybe more. Until today.
Today I resumed it, and was greeted with this in kernel log:
block manager: recursive lock detected in metadata
cache: 253:8: promotion failed; couldn't update on disk metadata
cache: 253:8: metadata operation 'dm_cache_insert_mapping' failed: error
= -22
cache: 253:8: aborting current metadata transaction
(http://paste.ubuntu.com/24526200 line 2445)
Fortunately cache_repair was able to repair it, and then filesystem
scrub found only 1 checksum error, so it's a lot better than previous
corruptions.
Maybe it's a different problem than before. Maybe it's unrelated to
suspending. I don't know. That's why I'm writing to the list.
Is my current approach to resuming theoretically safe with caching? Are
there some other gotchas I'm not aware of?
TL;DR: should LVM cache work properly with suspend to disk?
I'm using LVM cache for my /home volume since 2015 and had no troubles
with it, until I started to suspend my computer to disk every day.
At least once per 2-3 weeks the machine would lock up a few seconds
after resuming, and after reset it would usually work again, but
sometimes the cache would end up badly corrupted, unrepairable, and I
had to remove the cache. I had to switch to writethrough mode, as
removing writeback cache would leave the file system too corrupted to
repair.
I thought the problem was caused by initrd, as it activates volumes and
scans devices for filesystem superblocks before resuming. I thought
reads may cause blocks to be cached and modify the cache, and then the
resumed system doesn't know that something has changed and gets confused.
I attempted to fix it by adding "resume=/dev/sda1" to kernel command
line, where sda1 is my swap partition, which makes it resume without
starting initrd.
I switched to writeback again, and it was working fine for 6 months,
maybe more. Until today.
Today I resumed it, and was greeted with this in kernel log:
block manager: recursive lock detected in metadata
cache: 253:8: promotion failed; couldn't update on disk metadata
cache: 253:8: metadata operation 'dm_cache_insert_mapping' failed: error
= -22
cache: 253:8: aborting current metadata transaction
(http://paste.ubuntu.com/24526200 line 2445)
Fortunately cache_repair was able to repair it, and then filesystem
scrub found only 1 checksum error, so it's a lot better than previous
corruptions.
Maybe it's a different problem than before. Maybe it's unrelated to
suspending. I don't know. That's why I'm writing to the list.
Is my current approach to resuming theoretically safe with caching? Are
there some other gotchas I'm not aware of?