[linux-lvm] Caching policy in machine learning context

Discussion:

Jonas Degrave

2017-02-13 10:58:04 UTC

Hi,

We are a group of scientists, who work on reasonably sized datasets
(10-100GB). Because we had troubles managing our SSD's (everyone likes to
have their data on the SSD), I set up a caching system where the 500GB SSD
caches the 4TB HD. This way, everybody would have their data virtually on
the SSD, and only the first pass through the dataset would be slow.
Afterwards, it would be cached anyway, and the reads would be faster.

I used lvm-cache for this. Yet, it seems that the (only) smq-policy is very
reluctant in promoting data to the cache, whereas what we would need, is
that data is promoted basically upon the first read. Because if someone is
using the machine on certain data, they will most likely go over the
dataset a couple of hundred times in the following hours.

Right now, after a week of testing lvm-cache with the smq-policy, it looks

start 0
end 7516192768
segment_type cache
md_block_size 8
md_utilization 14353/1179648
cache_block_size 128
cache_utilization 7208960/7208960
read_hits 19954892
read_misses 84623959
read_hit_ratio 19.08%
write_hits 672621
write_misses 7336700
write_hit_ratio 8.40%
demotions 151757
promotions 151757
dirty 0
features 1
-------------------------------------------------------------------------
LVM [2.02.133(2)] cache report of found device /dev/VG/lv
-------------------------------------------------------------------------
- Cache Usage: 100.0% - Metadata Usage: 1.2%
- Read Hit Rate: 19.0% - Write Hit Rate: 8.3%
- Demotions/Promotions/Dirty: 151757/151757/0
- Feature arguments in use: writeback
- Core arguments in use : migration_threshold 2048 smq 0
- Cache Policy: stochastic multiqueue (smq)
- Cache Metadata Mode: rw
- MetaData Operation Health: ok

The number of promotions has been very low, even though the read hit rate
is low as well. This is with a cache of 450GB, and currently only 614GB of
data on the cached device. A read hit rate of lower than 20%, when just
randomly caching would have achieved 73% is not what I would have hoped to
get.

Is there a way to make the caching way more aggressive? Some settings I can
tweak?

Yours sincerely,

Jonas

Zdenek Kabelac

2017-02-13 12:55:41 UTC

Permalink

Post by Jonas Degrave
Hi,
We are a group of scientists, who work on reasonably sized datasets
(10-100GB). Because we had troubles managing our SSD's (everyone likes to have
their data on the SSD), I set up a caching system where the 500GB SSD caches
the 4TB HD. This way, everybody would have their data virtually on the SSD,
and only the first pass through the dataset would be slow. Afterwards, it
would be cached anyway, and the reads would be faster.
I used lvm-cache for this. Yet, it seems that the (only) smq-policy is very
reluctant in promoting data to the cache, whereas what we would need, is that
data is promoted basically upon the first read. Because if someone is using
the machine on certain data, they will most likely go over the dataset a
couple of hundred times in the following hours.
Right now, after a week of testing lvm-cache with the smq-policy, it looks
start 0
end 7516192768
segment_type cache
md_block_size 8
md_utilization 14353/1179648
cache_block_size 128
cache_utilization 7208960/7208960
read_hits 19954892
read_misses 84623959
read_hit_ratio 19.08%
write_hits 672621
write_misses 7336700
write_hit_ratio 8.40%
demotions 151757
promotions 151757
dirty 0
features 1
-------------------------------------------------------------------------
LVM [2.02.133(2)] cache report of found device /dev/VG/lv
-------------------------------------------------------------------------
- Cache Usage: 100.0% - Metadata Usage: 1.2%
- Read Hit Rate: 19.0% - Write Hit Rate: 8.3%
- Demotions/Promotions/Dirty: 151757/151757/0
- Feature arguments in use: writeback
- Core arguments in use : migration_threshold 2048 smq 0
- Cache Policy: stochastic multiqueue (smq)
- Cache Metadata Mode: rw
- MetaData Operation Health: ok
The number of promotions has been very low, even though the read hit rate is
low as well. This is with a cache of 450GB, and currently only 614GB of data
on the cached device. A read hit rate of lower than 20%, when just randomly
caching would have achieved 73% is not what I would have hoped to get.
Is there a way to make the caching way more aggressive? Some settings I can tweak?

Hi

You've not reported kernel version use.
Please provide results kernel 4.9.

Also note - cache will NOT cache blocks which are well enough covered by
'page-cache' and it's also 'slow' moving case - so it needs couple repeated
usage of blocks (without page-cache) to be promoted to cache.

Regards

Zdenek

Zdenek Kabelac

2017-02-13 14:33:53 UTC

Permalink

I am on kernel version 4.4.0-62-generic. I cannot upgrade to kernel 4.9, as it
did not play nice with
CUDA-drivers: https://devtalk.nvidia.com/default/topic/974733/nvidia-linux-driver-367-57-and-up-do-not-install-on-kernel-4-9-0-rc2-and-higher/
<https://devtalk.nvidia.com/default/topic/974733/nvidia-linux-driver-367-57-and-up-do-not-install-on-kernel-4-9-0-rc2-and-higher/>
Yes, I understand the cache needs repeated usage of blocks, but my question is
basically how many? And if I can lower that number?
In our use case, you basically read a certain group of 100GB of data
completely about 100 times. Then another user logs in, and reads a different
group of data about 100 times. But after a couple of such users, I observe
that only 20GB in total has been promoted to the cache. Even though the cache
is 450GB big, and could easily fit all the data one user would need.
So, I come to the conclusion that I need a more aggressive policy.
I now have a reported hit rate of 19.0%, when there is so few data on the
volume that 73% of the data would fit in the cache. I could probably solve
this issue by making the caching policy more aggressive. I am looking for a
way to do that.

There are 2 'knobs' - one is 'sequential_threshold' where cache tries
to avoid promoting 'long' continuous reads into cache - so if
you do 100G reads then these likely meet the criteria and are avoided from
being promoted (and I think this one is not configurable for smq.

Other is 'migration_threshold' which limit bandwidth load on cache device.

You can try to change its value:

lvchange --cachesettings migration_threshold=10000000 vg/cachedlv

(check with dmsetup status)

Not sure thought how are there things configurable with smq cache policy.

Regards

Zdenek

Jonas Degrave

2017-02-15 13:30:29 UTC

Permalink

Thanks, I tried your suggestions, and tried going back to the mq policy and
play with those parameters. In the end, I tried:

lvchange --cachesettings 'migration_threshold=20000000

sequential_threshold=10000000 read_promote_adjustment=1
write_promote_adjustment=4' VG

With little success. This is probably due to the mq-policy looking only at
the hit-count, rather than the hit-rate. Or at least, that is what I make
up from line 595 in the code:
http://lxr.free-electrons.com/source/drivers/md/dm-cache-policy-mq.c?v=3.19#L595

I wrote a small script, so my users could empty the cache manually, if they
want to:

#!/bin/bash

if [ "$(id -u)" != "0" ]; then
echo "This script must be run as root" 1>&2
exit 1
fi
lvremove -y VG/lv_cache
lvcreate -L 445G -n lv_cache VG /dev/sda
lvcreate -L 1G -n lv_cache_meta VG /dev/sda
lvconvert -y --type cache-pool --poolmetadata VG/lv_cache_meta VG/lv_cache
lvchange --cachepolicy smq VG
lvconvert --type cache --cachepool VG/lv_cache VG/lv

So, the only remaining option for me, would to write my own policy. This
should be quite simple, as you basically need to act as if the cache is not
full yet.

Can someone point me in the right direction as to how to do this? I have
tried to find the last version of the code, but the best I could find was a
redhat CVS-server which times out when connecting.

cvs [login aborted]: connect to sources.redhat.com(209.132.183.64):2401
failed: Connection timed out

Can someone direct me to the latest source of the smq-policy?

Yours sincerely,

Jonas

I am on kernel version 4.4.0-62-generic. I cannot upgrade to kernel 4.9, as it
did not play nice with
CUDA-drivers: https://devtalk.nvidia.com/def
ault/topic/974733/nvidia-linux-driver-367-57-and-up-do-not-
install-on-kernel-4-9-0-rc2-and-higher/
<https://devtalk.nvidia.com/default/topic/974733/nvidia-linu
x-driver-367-57-and-up-do-not-install-on-kernel-4-9-0-rc2-and-higher/>
Yes, I understand the cache needs repeated usage of blocks, but my question is
basically how many? And if I can lower that number?
In our use case, you basically read a certain group of 100GB of data
completely about 100 times. Then another user logs in, and reads a different
group of data about 100 times. But after a couple of such users, I observe
that only 20GB in total has been promoted to the cache. Even though the cache
is 450GB big, and could easily fit all the data one user would need.
So, I come to the conclusion that I need a more aggressive policy.
I now have a reported hit rate of 19.0%, when there is so few data on the
volume that 73% of the data would fit in the cache. I could probably solve
this issue by making the caching policy more aggressive. I am looking for a
way to do that.

There are 2 'knobs' - one is 'sequential_threshold' where cache tries
to avoid promoting 'long' continuous reads into cache - so if
you do 100G reads then these likely meet the criteria and are avoided from
being promoted (and I think this one is not configurable for smq.
Other is 'migration_threshold' which limit bandwidth load on cache device.
lvchange --cachesettings migration_threshold=10000000 vg/cachedlv
(check with dmsetup status)
Not sure thought how are there things configurable with smq cache policy.
Regards
Zdenek

Zdenek Kabelac

2017-02-16 10:29:53 UTC

Permalink

Post by Jonas Degrave
Thanks, I tried your suggestions, and tried going back to the mq policy and
lvchange --cachesettings 'migration_threshold=20000000
sequential_threshold=10000000 read_promote_adjustment=1
write_promote_adjustment=4' VG
With little success. This is probably due to the mq-policy looking only at the
hit-count, rather than the hit-rate. Or at least, that is what I make up from
http://lxr.free-electrons.com/source/drivers/md/dm-cache-policy-mq.c?v=3.19#L595
I wrote a small script, so my users could empty the cache manually, if they
#!/bin/bash
if [ "$(id -u)" != "0" ]; then
echo "This script must be run as root" 1>&2
exit 1
fi
lvremove -y VG/lv_cache
lvcreate -L 445G -n lv_cache VG /dev/sda
lvcreate -L 1G -n lv_cache_meta VG /dev/sda
lvconvert -y --type cache-pool --poolmetadata VG/lv_cache_meta VG/lv_cache
lvchange --cachepolicy smq VG
lvconvert --type cache --cachepool VG/lv_cache VG/lv
So, the only remaining option for me, would to write my own policy. This
should be quite simple, as you basically need to act as if the cache is not
full yet.
Can someone point me in the right direction as to how to do this? I have tried
to find the last version of the code, but the best I could find was a redhat
CVS-server which times out when connecting.
cvs [login aborted]: connect to sources.redhat.com
<http://sources.redhat.com>(209.132.183.64):2401 failed: Connection timed out
Can someone direct me to the latest source of the smq-policy?

Hi

Yep - it does look like you have some special use-case where you know 'ahead
of time' what's the usage pattern going to be.

'smq' policy is targeted to rather 'slowly' fill over the time with 'more time
permanent data' which are known to be kept used over and over - so i.e. after
reboot there is large chance you will need them again.

But in your case it seems you need a policy which fills very quickly with
current set of date - i.e. some sore of page-cache extension.

So to get to the source:

https://github.com/torvalds/linux/blob/master/drivers/md/dm-cache-policy-smq.c

relatively 'small' piece of code - by may take a while to get to it as you
need to fit within policy rules - there is certain limited amount of data you
may keep with cached data block and some others...

Once you get new dm caching policy loaded - lvm2 should be able to use it,
as cache_policy & cache_settings are 'free-from' strings.

For 4.12 kernel (likely) there is going to be new 'cache2-like' which should
be match faster with startup... but likely it may or may not solve your
special 100GB workload.

Regards

Zdenek