[linux-lvm] Testing the new LVM cache feature

Discussion:

[linux-lvm] Testing the new LVM cache feature

Richard W.M. Jones

2014-05-22 10:18:37 UTC

I've set up a computer in order to test the new LVM cache feature. It
has a pair of 2 TB HDDs in RAID 1 configuration, and a 256 GB SSD.
The setup will be used to store large VM disk images in an ext4
filesystem, to be served both locally and over NFS.

Before I start I have some questions about this feature:

(1) Is there a minimum recommended version of LVM or kernel to use? I
currently have lvm2-2.02.106-1.fc20.x86_64, which mentions LVM cache
in the lvm(8) man page. I have kernel 3.14.3-200.fc20.x86_64.

(2) There is no lvmcache(7) man page in any released version of LVM2.
Was this man page ever created or is lvm(8) the definitive
documentation?

(3) It looks as if cached LVs cannot be resized:
https://www.redhat.com/archives/lvm-devel/2014-February/msg00119.html
Will this be fixed in future? Is there any workaround -- perhaps
removing the caching layer, resizing the original LV, then recreating
the cache? I really need to be able to resize LVs :-)

(4) To calculate the size of the cache metadata LV, do I really just
divide by 1000, min 8 MB? It's that simple? Doesn't it depend on
dm-cache block size? Or dm-cache algorithm? How can I choose block
size and algorithm?

(5) Is there an explicit command for flushing the cache layer back to
the origin LV?

(6) Is the on-disk format stable for future kernel/LVM upgrades?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org

Zdenek Kabelac

2014-05-22 14:43:49 UTC

Post by Richard W.M. Jones
I've set up a computer in order to test the new LVM cache feature. It
has a pair of 2 TB HDDs in RAID 1 configuration, and a 256 GB SSD.
The setup will be used to store large VM disk images in an ext4
filesystem, to be served both locally and over NFS.
(1) Is there a minimum recommended version of LVM or kernel to use? I
currently have lvm2-2.02.106-1.fc20.x86_64, which mentions LVM cache
in the lvm(8) man page. I have kernel 3.14.3-200.fc20.x86_64.

With these new targets usually always applies - the newer the kernel and tools
are - the better for you.

Post by Richard W.M. Jones
(2) There is no lvmcache(7) man page in any released version of LVM2.
Was this man page ever created or is lvm(8) the definitive
documentation?

It's now in upstream git as a separate man page (moved from lvm(8))

Post by Richard W.M. Jones
https://www.redhat.com/archives/lvm-devel/2014-February/msg00119.html
Will this be fixed in future? Is there any workaround -- perhaps

Yes - cache is still missing a lot of feature - it needs further
integration with tools like cache_check, cache_repair....

So far it's really only for a preview - I'd not consider to use it
for anything serious yet.

Post by Richard W.M. Jones
removing the caching layer, resizing the original LV, then recreating
the cache? I really need to be able to resize LVs :-)

Surely this feature will be implemented.
Meanwhile - you have to drop cache, resize LV, reattach cache...
(drop cache - means to remove cache)

Post by Richard W.M. Jones
(4) To calculate the size of the cache metadata LV, do I really just
divide by 1000, min 8 MB? It's that simple? Doesn't it depend on
dm-cache block size? Or dm-cache algorithm? How can I choose block
size and algorithm?

Well this is where your experimenting may begin.
However for now lvm2 doesn't allow you to play with algorithms - the lvchange
interface is not yet upstream...

Post by Richard W.M. Jones
(5) Is there an explicit command for flushing the cache layer back to
the origin LV?

To be developed...

Post by Richard W.M. Jones
(6) Is the on-disk format stable for future kernel/LVM upgrades?

Well it's still experiemental - so if there will be found some huge problem,
which requires to change/modify format it may happen.

Zdenek

Richard W.M. Jones

2014-05-22 15:22:32 UTC

Well I'm happy to experiment for you.

At the moment I'm stuck here:

# vgcreate vg_cache /dev/sdc1
Volume group "vg_cache" successfully created
# lvcreate -L 1G -n lv_cache_meta vg_cache
Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_cache
Logical volume "lv_cache" created
# lvs
LV VG Attr LSize [...]
lv_cache vg_cache Cwi---C--- 229.00g
lv_cache_meta vg_cache -wi-a----- 1.00g
testoriginlv vg_guests -wi-a----- 100.00g

# lvconvert --type cache-pool --poolmetadata /dev/vg_cache/lv_cache_meta /dev/vg_cache/lv_cache
Logical volume "lvol0" created
Converted vg_cache/lv_cache to cache pool.

# lvs
LV VG Attr LSize [...]
lv_cache vg_cache Cwi---C--- 229.00g
testoriginlv vg_guests -wi-a----- 100.00g

# lvconvert --type cache --cachepool vg_cache/lv_cache vg_guests/testoriginlv
Unable to find cache pool LV, vg_cache/lv_cache
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems as if vg_cache/lv_cache is a "cache pool" but for some reason
lvconvert is unable to use it.

The error seems to come from this code:

if (!(cachepool = find_lv(origin->vg, lp->cachepool))) {
log_error("Unable to find cache pool LV, %s", lp->cachepool);
return 0;
}

Is it looking in the wrong VG?

Or do I have to have a single VG for this to work? (That's not made
clear in the documentation, and it seems like a strange restriction).

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org

Richard W.M. Jones

2014-05-22 15:49:47 UTC

It works once I use a single VG.

However the performance is exactly the same as the backing hard disk,
not the SDD. It seems I'm getting no benefit ...

# lvs
[...]
testoriginlv vg_guests Cwi-a-C--- 100.00g lv_cache [testoriginlv_corig]

# mount /dev/vg_guests/testoriginlv /tmp/mnt
# cd /tmp/mnt

# dd if=/dev/zero of=test.file bs=64K count=100000 oflag=direct
100000+0 records in
100000+0 records out
6553600000 bytes (6.6 GB) copied, 57.6301 s, 114 MB/s

# dd if=test.file of=/dev/zero bs=64K iflag=direct
100000+0 records in
100000+0 records out
6553600000 bytes (6.6 GB) copied, 47.6587 s, 138 MB/s

(Exactly the same numbers as when I tested the underlying HDD, and
about half the performance of the SDD.)

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Mike Snitzer

2014-05-22 18:04:05 UTC

On Thu, May 22 2014 at 11:49am -0400,

Post by Richard W.M. Jones
It works once I use a single VG.
However the performance is exactly the same as the backing hard disk,
not the SDD. It seems I'm getting no benefit ...
# lvs
[...]
testoriginlv vg_guests Cwi-a-C--- 100.00g lv_cache [testoriginlv_corig]
# mount /dev/vg_guests/testoriginlv /tmp/mnt
# cd /tmp/mnt
# dd if=/dev/zero of=test.file bs=64K count=100000 oflag=direct
100000+0 records in
100000+0 records out
6553600000 bytes (6.6 GB) copied, 57.6301 s, 114 MB/s
# dd if=test.file of=/dev/zero bs=64K iflag=direct
100000+0 records in
100000+0 records out
6553600000 bytes (6.6 GB) copied, 47.6587 s, 138 MB/s
(Exactly the same numbers as when I tested the underlying HDD, and
about half the performance of the SDD.)

By default dm-cache (as is currently upstream) is _not_ going to cache
sequential IO, and it also isn't going to cache IO that is first
written. It waits for hit counts to elevate to the promote threshold.
So dm-cache effectively acts as a hot-spot cache by default.

If you want dm-cache to be more aggressive for initial writes, you can:
1) discard the entire dm-cache device before use (either with mkfs,
blkdiscard, or fstrim)
2) set the dm-cache 'write_promote_adjustment' tunable to 0 with the DM
message interface, e.g.:
dmsetup message <mapped device> 0 write_promote_adjustment 0

Additional documentation is available in the kernel tree:
Documentation/device-mapper/cache.txt
Documentation/device-mapper/cache-policies.txt

Joe Thornber is also working on significant bursty write performance
improvements for dm-cache. Hopefully they'll be ready to go upstream
for the Linux 3.16 merge window.

Mike

Richard W.M. Jones

2014-05-22 18:13:34 UTC

Post by Mike Snitzer
By default dm-cache (as is currently upstream) is _not_ going to cache
sequential IO, and it also isn't going to cache IO that is first
written. It waits for hit counts to elevate to the promote threshold.
So dm-cache effectively acts as a hot-spot cache by default.

OK, that makes sense, thanks.

I wrote about using the LVM cache feature here:

https://rwmj.wordpress.com/2014/05/22/using-lvms-new-cache-feature/#content

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Richard W.M. Jones

2014-05-29 13:52:46 UTC

I've done some more testing, comparing RAID 1 HDD with RAID 1 HDD + an
SSD overlay (using lvm-cache).

I'm now using 'fio', with the following job file:

[virt]
ioengine=libaio
iodepth=4
rw=randrw
bs=64k
direct=1
size=1g
numjobs=4

I'm still seeing almost no benefit from LVM cache. It's about 4%
faster than the underlying, slow HDDs. See attached runs.

The SSD LV is 200 GB and the underlying LV is 800 GB, so I would
expect there is plenty of space to cache things in the SSD during the
test.

For comparison, the fio tests runs about 11 times faster on the SSD.

Any ideas?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Mike Snitzer

2014-05-29 20:34:10 UTC

On Thu, May 29 2014 at 9:52am -0400,

Post by Richard W.M. Jones
I've done some more testing, comparing RAID 1 HDD with RAID 1 HDD + an
SSD overlay (using lvm-cache).
[virt]
ioengine=libaio
iodepth=4
rw=randrw
bs=64k
direct=1
size=1g
numjobs=4

randrw isn't giving you increased hits to the same blocks. fio does
have random_distribution controls (zipf and pareto) that are more
favorable for testing cache replacement policies (jens said that testing
caching algorithms is what motivated him to develop these in fio).

Post by Richard W.M. Jones
I'm still seeing almost no benefit from LVM cache. It's about 4%
faster than the underlying, slow HDDs. See attached runs.
The SSD LV is 200 GB and the underlying LV is 800 GB, so I would
expect there is plenty of space to cache things in the SSD during the
test.
For comparison, the fio tests runs about 11 times faster on the SSD.
Any ideas?

Try using :
dmsetup message <cache device> 0 write_promote_adjustment 0

Also, if you discard the entire cache device (e.g. using blkdiscard)
before use you could get a big win, especially if you use:
dmsetup message <cache device> 0 discard_promote_adjustment 0

Documentation/device-mapper/cache-policies.txt says:

Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache. The read, write and discard promote adjustment
tunables allow you to tweak the promotion threshold by adding a small
value based on the io type. They default to 4, 8 and 1 respectively.
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.

Richard W.M. Jones

2014-05-29 20:47:20 UTC

Post by Mike Snitzer
dmsetup message <cache device> 0 write_promote_adjustment 0
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache. The read, write and discard promote adjustment
tunables allow you to tweak the promotion threshold by adding a small
value based on the io type. They default to 4, 8 and 1 respectively.
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.

What would be bad about leaving write_promote_adjustment set at 0 or 1?

Wouldn't that mean that I get a simple LRU policy? (That's probably
what I want.)

Post by Mike Snitzer
Also, if you discard the entire cache device (e.g. using blkdiscard)
dmsetup message <cache device> 0 discard_promote_adjustment 0

To be clear, that means I should do:

lvcreate -L 1G -n lv_cache_meta vg_guests /dev/fast
lvcreate -L 229G -n lv_cache vg_guests /dev/fast
lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
blkdiscard /dev/vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv

Or should I do the blkdiscard earlier?

[On the separate subject of volume groups ...]

Is there a reason why fast and slow devices need to be in the same VG?

I've talked to two other people who found this very confusing. No one
knew that you could manually place LVs into different PVs, and it's
something of a pain to have to remember to place LVs every time you
create or resize one. It seems it would be a lot simpler if you could
have the slow PVs in one VG and the fast PVs in another VG.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Mike Snitzer

2014-05-29 21:06:48 UTC

On Thu, May 29 2014 at 4:47pm -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer
dmsetup message <cache device> 0 write_promote_adjustment 0
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache. The read, write and discard promote adjustment
tunables allow you to tweak the promotion threshold by adding a small
value based on the io type. They default to 4, 8 and 1 respectively.
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.

What would be bad about leaving write_promote_adjustment set at 0 or 1?
Wouldn't that mean that I get a simple LRU policy? (That's probably
what I want.)

Leaving them at 0 could result in cache thrashing. But given how large
your SSD is in relation to the origin you'd likely be OK for a while (at
least until your cache gets quite full).

Post by Richard W.M. Jones

Post by Mike Snitzer
Also, if you discard the entire cache device (e.g. using blkdiscard)
dmsetup message <cache device> 0 discard_promote_adjustment 0

lvcreate -L 1G -n lv_cache_meta vg_guests /dev/fast
lvcreate -L 229G -n lv_cache vg_guests /dev/fast
lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
blkdiscard /dev/vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
Or should I do the blkdiscard earlier?

You want to discard the cached device before you run fio against it.
I'm not completely sure what cache-pool vs cache is. But it looks like
you'd want to run the discard against the /dev/vg_guests/testoriginlv
(assuming it was converted to use the 'cache' DM target, 'dmsetup table
vg_guests-testoriginlv' should confirm as much).

Post by Richard W.M. Jones
[On the separate subject of volume groups ...]
Is there a reason why fast and slow devices need to be in the same VG?
I've talked to two other people who found this very confusing. No one
knew that you could manually place LVs into different PVs, and it's
something of a pain to have to remember to place LVs every time you
create or resize one. It seems it would be a lot simpler if you could
have the slow PVs in one VG and the fast PVs in another VG.

I cannot answer the lvm details. Best to ask Jon Brassow or Zdenek
(hopefully they'll respond)

Richard W.M. Jones

2014-05-29 21:19:55 UTC

Post by Mike Snitzer
On Thu, May 29 2014 at 4:47pm -0400,

Post by Richard W.M. Jones
lvcreate -L 1G -n lv_cache_meta vg_guests /dev/fast
lvcreate -L 229G -n lv_cache vg_guests /dev/fast
lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
blkdiscard /dev/vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
Or should I do the blkdiscard earlier?

You want to discard the cached device before you run fio against it.
I'm not completely sure what cache-pool vs cache is. But it looks like
you'd want to run the discard against the /dev/vg_guests/testoriginlv
(assuming it was converted to use the 'cache' DM target, 'dmsetup table
vg_guests-testoriginlv' should confirm as much).

I'm concerned that would delete all the data on the origin LV ...

My origin LV now has a slightly different name. Here are the
device-mapper tables:

$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008

So it does look as if my origin LV (vg_guests/libvirt-images) does use
the 'cache' target.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Mike Snitzer

2014-05-29 21:58:15 UTC

On Thu, May 29 2014 at 5:19pm -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer
On Thu, May 29 2014 at 4:47pm -0400,

Post by Richard W.M. Jones
lvcreate -L 1G -n lv_cache_meta vg_guests /dev/fast
lvcreate -L 229G -n lv_cache vg_guests /dev/fast
lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
blkdiscard /dev/vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
Or should I do the blkdiscard earlier?

You want to discard the cached device before you run fio against it.
I'm not completely sure what cache-pool vs cache is. But it looks like
you'd want to run the discard against the /dev/vg_guests/testoriginlv
(assuming it was converted to use the 'cache' DM target, 'dmsetup table
vg_guests-testoriginlv' should confirm as much).

I'm concerned that would delete all the data on the origin LV ...

OK, but how are you testing with fio at this point? Doesn't that
destroy data too?

The cache target doesn't have passdown support. So none of your data
would be discarded directly, but it could eat data as a side-effect of
the cache bypassing promotion from the origin (because it thinks the
origin's blocks were discarded). But on writeback you'd lose data.

So you raise a valid point: if you're adding a cache in front of a
volume with existing data you'll want to avoid discarding the logical
address space that contains data you want to keep.

Do you have a filesystem on the libvirt-images volume? If so, would be
enough to run fstrim against /dev/vg_guests/libvirt-images

BTW, this is all with a eye toward realizing the optimization that
dm-cache provides for origin blocks that were discarded (like I said
before dm-cache doesn't promote from the origin if the corresponding
block was marked for discard). So you don't _need_ to do any of
this.. purely about trying to optimize a bit more.

Post by Richard W.M. Jones
My origin LV now has a slightly different name. Here are the
$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008
So it does look as if my origin LV (vg_guests/libvirt-images) does use
the 'cache' target.

Yeap.

Richard W.M. Jones

2014-05-30 09:04:22 UTC

Post by Mike Snitzer

Post by Richard W.M. Jones
I'm concerned that would delete all the data on the origin LV ...

OK, but how are you testing with fio at this point? Doesn't that
destroy data too?

I'm testing with files. This matches my final configuration which is
to use qcow2 files on an ext4 filesystem to store the VM disk images.

I set read_promote_adjustment == write_promote_adjustment == 1 and ran
fio 6 times, reusing the same test files.

It is faster than HDD (slower layer), but still much slower than the
SSD (fast layer). Across the fio runs it's about 5 times slower than
the SSD, and the times don't improve at all over the runs. (It is
more than twice as fast as the HDD though).

Somehow something is not working as I expected.

Post by Mike Snitzer

Post by Richard W.M. Jones
What would be bad about leaving write_promote_adjustment set at 0 or 1?
Wouldn't that mean that I get a simple LRU policy? (That's probably
what I want.)

Leaving them at 0 could result in cache thrashing. But given how
large your SSD is in relation to the origin you'd likely be OK for a
while (at least until your cache gets quite full).

My SSD is ~200 GB and the backing origin LV is ~800 GB. It is
unlikely the working set will ever grow > 200 GB, not least because I
cannot run that many VMs at the same time on the cluster.

So should I be concerned about cache thrashing? Specifically: If the
cache layer gets full, then it will send the least recently used
blocks back to the slow layer, right? (It seems obvious, but I'd like
to check that)

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Richard W.M. Jones

2014-05-30 10:30:31 UTC

Post by Richard W.M. Jones

Post by Mike Snitzer

Post by Richard W.M. Jones
I'm concerned that would delete all the data on the origin LV ...

OK, but how are you testing with fio at this point? Doesn't that
destroy data too?

I'm testing with files. This matches my final configuration which is
to use qcow2 files on an ext4 filesystem to store the VM disk images.
I set read_promote_adjustment == write_promote_adjustment == 1 and ran
fio 6 times, reusing the same test files.
It is faster than HDD (slower layer), but still much slower than the
SSD (fast layer). Across the fio runs it's about 5 times slower than
the SSD, and the times don't improve at all over the runs. (It is
more than twice as fast as the HDD though).
Somehow something is not working as I expected.

Additionally, I ran this command 5 times:

md5sum virt.* # the test files

and then reran the fio test. Since I have read_promote_adjustment == 1,
I would expect that these files should be promoted to the fast layer
by reading them several times.

However the results are still the same. It's about twice as fast as
the HDDs, but 5 times slower than with the SDD.

Are there additional diagnostic commands I can use?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Mike Snitzer

2014-05-30 13:38:14 UTC

On Fri, May 30 2014 at 5:04am -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer

Post by Richard W.M. Jones
I'm concerned that would delete all the data on the origin LV ...

OK, but how are you testing with fio at this point? Doesn't that
destroy data too?

I'm testing with files. This matches my final configuration which is
to use qcow2 files on an ext4 filesystem to store the VM disk images.
I set read_promote_adjustment == write_promote_adjustment == 1 and ran
fio 6 times, reusing the same test files.
It is faster than HDD (slower layer), but still much slower than the
SSD (fast layer). Across the fio runs it's about 5 times slower than
the SSD, and the times don't improve at all over the runs. (It is
more than twice as fast as the HDD though).
Somehow something is not working as I expected.

Why are you setting {read,write}_promote_adjustment to 1? I asked you
to set write_promote_adjustment to 0.

Your random fio job won't hit the same blocks, and md5sum likely uses
buffered IO so unless you set 0 for both the cache won't aggressively
cache like you're expecting.

I explained earlier in this thread that the dm-cache is currently a
"hotspot cache". Not a pure writeback cache like you're hoping. We're
working to make it fit your expectations (you aren't alone in expecting
more performance!)

Post by Richard W.M. Jones

Post by Mike Snitzer

Post by Richard W.M. Jones
What would be bad about leaving write_promote_adjustment set at 0 or 1?
Wouldn't that mean that I get a simple LRU policy? (That's probably
what I want.)

Leaving them at 0 could result in cache thrashing. But given how
large your SSD is in relation to the origin you'd likely be OK for a
while (at least until your cache gets quite full).

My SSD is ~200 GB and the backing origin LV is ~800 GB. It is
unlikely the working set will ever grow > 200 GB, not least because I
cannot run that many VMs at the same time on the cluster.
So should I be concerned about cache thrashing? Specifically: If the
cache layer gets full, then it will send the least recently used
blocks back to the slow layer, right? (It seems obvious, but I'd like
to check that)

Right, you should be fine. But I'll defer to Heinz on more particulars
about the cache replacement strategy that is provided in this case for
the "mq" (aka multi-queue policy).

Richard W.M. Jones

2014-05-30 13:40:20 UTC

Post by Mike Snitzer
Why are you setting {read,write}_promote_adjustment to 1? I asked you
to set write_promote_adjustment to 0.

I didn't realize there would be (much) difference. However I
will certainly try it with write_promote_adjustment == 0.

Post by Mike Snitzer
Your random fio job won't hit the same blocks, and md5sum likely uses
buffered IO so unless you set 0 for both the cache won't aggressively
cache like you're expecting.

Right, that was definitely a mistake! I will drop_caches between each
md5sum operation.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Heinz Mauelshagen

2014-05-30 13:42:23 UTC

Post by Mike Snitzer
On Fri, May 30 2014 at 5:04am -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer

Post by Richard W.M. Jones
I'm concerned that would delete all the data on the origin LV ...

OK, but how are you testing with fio at this point? Doesn't that
destroy data too?

I'm testing with files. This matches my final configuration which is
to use qcow2 files on an ext4 filesystem to store the VM disk images.
I set read_promote_adjustment == write_promote_adjustment == 1 and ran
fio 6 times, reusing the same test files.
It is faster than HDD (slower layer), but still much slower than the
SSD (fast layer). Across the fio runs it's about 5 times slower than
the SSD, and the times don't improve at all over the runs. (It is
more than twice as fast as the HDD though).
Somehow something is not working as I expected.

Why are you setting {read,write}_promote_adjustment to 1? I asked you
to set write_promote_adjustment to 0.
Your random fio job won't hit the same blocks, and md5sum likely uses
buffered IO so unless you set 0 for both the cache won't aggressively
cache like you're expecting.
I explained earlier in this thread that the dm-cache is currently a
"hotspot cache". Not a pure writeback cache like you're hoping. We're
working to make it fit your expectations (you aren't alone in expecting
more performance!)

Post by Richard W.M. Jones

Post by Mike Snitzer

Post by Richard W.M. Jones
What would be bad about leaving write_promote_adjustment set at 0 or 1?
Wouldn't that mean that I get a simple LRU policy? (That's probably
what I want.)

Leaving them at 0 could result in cache thrashing. But given how
large your SSD is in relation to the origin you'd likely be OK for a
while (at least until your cache gets quite full).

My SSD is ~200 GB and the backing origin LV is ~800 GB. It is
unlikely the working set will ever grow > 200 GB, not least because I
cannot run that many VMs at the same time on the cluster.
So should I be concerned about cache thrashing? Specifically: If the
cache layer gets full, then it will send the least recently used
blocks back to the slow layer, right? (It seems obvious, but I'd like
to check that)

Right, you should be fine. But I'll defer to Heinz on more particulars
about the cache replacement strategy that is provided in this case for
the "mq" (aka multi-queue policy).

If you ask for immediate promotion, you get immediate promotion if the
cache gets overcommited.
Of course you can tweak the promotion adjustments after warming the cache in
order to reduce any thrashing

Heinz

Richard W.M. Jones

2014-05-30 13:54:07 UTC

I'm attaching 3 tests that I have run so (hopefully) you can see
what I'm observing, or point out if I'm making a mistake.

- virt-ham0-raid1.txt

Test with an ext4 filesystem located in an LV on the RAID 1 (md)
array of 2 x WD NAS hard disks.

- virt-ham0-ssd.txt

Test with an ext4 filesystem located in an LV on the Samsung EVO SSD.

- virt-ham0-lvmcache.txt

Test with LVM-cache.

For all tests, the same virt.job file is used:

[virt]
ioengine=libaio
iodepth=4
rw=randrw
bs=64k
direct=1
size=1g
numjobs=4

All tests are run on the same hardware.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org

Zdenek Kabelac

2014-05-30 13:58:00 UTC

Post by Richard W.M. Jones
I'm attaching 3 tests that I have run so (hopefully) you can see
what I'm observing, or point out if I'm making a mistake.

I'd have asked - is there any difference in the test perfomance if you use
ramdisk device for your cache metadata device.
(So _cdata stays on 'ssd', just _cmeta is located on i.e. loop0 with backend
file in your tmpfs ramdisk device ?)

Zdenek

Richard W.M. Jones

2014-05-30 13:46:42 UTC

I have now set both read_promote_adjustment ==
write_promote_adjustment == 0 and used drop_caches between runs.

I also read Documentation/device-mapper/cache-policies.txt at Heinz's
suggestion.

I'm afraid the performance of the fio test is still not the same as
the SSD (4.8 times slower than the SSD-only test now).

Would repeated runs of (md5sum virt.* ; echo 3 > /proc/sys/vm/drop_caches)
not eventually cause the whole file to be placed on the SSD?
It does seem very counter-intuitive if not.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Heinz Mauelshagen

2014-05-30 13:54:49 UTC

Post by Richard W.M. Jones
I have now set both read_promote_adjustment ==
write_promote_adjustment == 0 and used drop_caches between runs.

Did you adjust "sequential_threshold 0" as well?

dm-cache tries to avoid promoting large sequential files to the cache,
because spindles have good bandwidth.

This is again because of the hot spot caching nature of dm-cache.

Post by Richard W.M. Jones
I also read Documentation/device-mapper/cache-policies.txt at Heinz's
suggestion.
I'm afraid the performance of the fio test is still not the same as
the SSD (4.8 times slower than the SSD-only test now).
Would repeated runs of (md5sum virt.* ; echo 3 > /proc/sys/vm/drop_caches)
not eventually cause the whole file to be placed on the SSD?
It does seem very counter-intuitive if not.

Please retry with "sequential_threshold 0"

Heinz

Post by Richard W.M. Jones
Rich.

Richard W.M. Jones

2014-05-30 14:26:02 UTC

Post by Heinz Mauelshagen

Post by Richard W.M. Jones
I have now set both read_promote_adjustment ==
write_promote_adjustment == 0 and used drop_caches between runs.

Did you adjust "sequential_threshold 0" as well?
dm-cache tries to avoid promoting large sequential files to the cache,
because spindles have good bandwidth.
This is again because of the hot spot caching nature of dm-cache.

Setting this had no effect.

I starting to wonder if my settings are having any effect at all.

Here are the device-mapper tables:

$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008

And here is the command I used to set sequential_threshold to 0
(there was no error and no other output):

$ sudo dmsetup message vg_guests-libvirt--images 0 sequential_threshold 0

Is there a way to print the current settings?

Could writethrough be enabled? (I'm supposed to be using writeback).
How do I find out?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Mike Snitzer

2014-05-30 14:29:26 UTC

On Fri, May 30 2014 at 10:26am -0400,

Post by Richard W.M. Jones

Post by Heinz Mauelshagen

Post by Richard W.M. Jones
I have now set both read_promote_adjustment ==
write_promote_adjustment == 0 and used drop_caches between runs.

Did you adjust "sequential_threshold 0" as well?
dm-cache tries to avoid promoting large sequential files to the cache,
because spindles have good bandwidth.
This is again because of the hot spot caching nature of dm-cache.

Setting this had no effect.
I starting to wonder if my settings are having any effect at all.
$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008
And here is the command I used to set sequential_threshold to 0
$ sudo dmsetup message vg_guests-libvirt--images 0 sequential_threshold 0

sequential_threshold is only going to help the md5sum's IO get promoted
(assuming you're having it read a large file).

Post by Richard W.M. Jones
Is there a way to print the current settings?
Could writethrough be enabled? (I'm supposed to be using writeback).
How do I find out?

dmsetup status vg_guests-libvirt--images

But I'm really wondering if your IO is misaligned (like my earlier email
brought up). It _could_ be promoting 2 64K blocks from the origin for
every 64K IO.

Richard W.M. Jones

2014-05-30 14:36:59 UTC

Post by Mike Snitzer
sequential_threshold is only going to help the md5sum's IO get promoted
(assuming you're having it read a large file).

Note the fio test runs on the virt.* files. I'm using md5sum in an
attempt to pull those same files into the SSD.

Post by Mike Snitzer

Post by Richard W.M. Jones
Is there a way to print the current settings?
Could writethrough be enabled? (I'm supposed to be using writeback).
How do I find out?

dmsetup status vg_guests-libvirt--images

Here's dmsetup status on various objects:

$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008
$ sudo dmsetup status vg_guests-libvirt--images
0 1677721600 cache 8 10162/262144 128 39839/3276800 1087840 821795 116320 2057235 0 39835 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 0 write_promote_adjustment 0
$ sudo dmsetup status vg_guests-lv_cache_cdata
0 419430400 linear
$ sudo dmsetup status vg_guests-lv_cache_cmeta
0 2097152 linear
$ sudo dmsetup status vg_guests-libvirt--images_corig
0 1677721600 linear

Post by Mike Snitzer
But I'm really wondering if your IO is misaligned (like my earlier email
brought up). It _could_ be promoting 2 64K blocks from the origin for
every 64K IO.

There's nothing obviously wrong ...

** For the SSD **

$ sudo fdisk -l /dev/sdc

Disk /dev/sdc: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x3e302f2a

Device Boot Start End Blocks Id System
/dev/sdc1 2048 488397167 244197560 8e Linux LVM

The PV is placed directly on /dev/sdc1.

** For the HDD array **

$ sudo fdisk -l /dev/sd{a,b}

Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: B9545B67-681D-4729-A8A0-C75CB2EFFCB1

Device Start End Size Type
/dev/sda1 2048 3907029134 1.8T Linux filesystem

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: EFA66BD1-E813-4826-88A2-F2BB3C2E093E

Device Start End Size Type
/dev/sdb1 2048 3907029134 1.8T Linux filesystem

$ cat /proc/mdstat
Personalities : [raid1]
md127 : active raid1 sdb1[2] sda1[1]
1953382272 blocks super 1.2 [2/2] [UU]

unused devices: <none>

The PV is placed on /dev/md127.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Mike Snitzer

2014-05-30 14:44:54 UTC

On Fri, May 30 2014 at 10:36am -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer
sequential_threshold is only going to help the md5sum's IO get promoted
(assuming you're having it read a large file).

Note the fio test runs on the virt.* files. I'm using md5sum in an
attempt to pull those same files into the SSD.

Post by Mike Snitzer

Post by Richard W.M. Jones
Is there a way to print the current settings?
Could writethrough be enabled? (I'm supposed to be using writeback).
How do I find out?

dmsetup status vg_guests-libvirt--images

$ sudo dmsetup table
vg_guests-lv_cache_cdata: 0 419430400 linear 8:33 2099200
vg_guests-lv_cache_cmeta: 0 2097152 linear 8:33 2048
vg_guests-home: 0 209715200 linear 9:127 2048
vg_guests-libvirt--images: 0 1677721600 cache 253:1 253:0 253:2 128 0 default 0
vg_guests-libvirt--images_corig: 0 1677721600 linear 9:127 2055211008
$ sudo dmsetup status vg_guests-libvirt--images
0 1677721600 cache 8 10162/262144 128 39839/3276800 1087840 821795 116320 2057235 0 39835 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 0 write_promote_adjustment 0
$ sudo dmsetup status vg_guests-lv_cache_cdata
0 419430400 linear
$ sudo dmsetup status vg_guests-lv_cache_cmeta
0 2097152 linear
$ sudo dmsetup status vg_guests-libvirt--images_corig
0 1677721600 linear

Post by Mike Snitzer
But I'm really wondering if your IO is misaligned (like my earlier email
brought up). It _could_ be promoting 2 64K blocks from the origin for
every 64K IO.

There's nothing obviously wrong ...

I'm not talking about alignment relative to the physical device's
limits. I'm talking about alignment of ext4's data areas relative to
the 64K block boundaries.

Also a point of conern would be: how fragmented is the ext4 space? It
could be that it cannot get contiguous 64K regions from the namespace.
If that is the case than a lot more IO would get pulled in.

Can you try reducing the cache blocksize to 32K (lowest we support at
the moment, it'll require you to remove the cache and recreate) to see
if performance for this 64K random IO workload improves? If so it does
start to add weight to my alignment concerns.

Mike

Richard W.M. Jones

2014-05-30 14:51:52 UTC

Post by Mike Snitzer
I'm not talking about alignment relative to the physical device's
limits. I'm talking about alignment of ext4's data areas relative to
the 64K block boundaries.
Also a point of conern would be: how fragmented is the ext4 space? It
could be that it cannot get contiguous 64K regions from the namespace.
If that is the case than a lot more IO would get pulled in.

I would be surprised if it was fragmented, since it's a recently
created filesystem which has only been used to store a few huge disk
images ...

Post by Mike Snitzer
Can you try reducing the cache blocksize to 32K (lowest we support at
the moment, it'll require you to remove the cache and recreate) to see
if performance for this 64K random IO workload improves? If so it does
start to add weight to my alignment concerns.

... nevertheless what I will do is recreate the origin LV, ext4
filesystem, and change the block size.

What is the command to set the cache blocksize? It doesn't seem to be
covered in the documentation anywhere.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Mike Snitzer

2014-05-30 14:58:03 UTC

On Fri, May 30 2014 at 10:51am -0400,

Post by Richard W.M. Jones

Post by Mike Snitzer
I'm not talking about alignment relative to the physical device's
limits. I'm talking about alignment of ext4's data areas relative to
the 64K block boundaries.
Also a point of conern would be: how fragmented is the ext4 space? It
could be that it cannot get contiguous 64K regions from the namespace.
If that is the case than a lot more IO would get pulled in.

I would be surprised if it was fragmented, since it's a recently
created filesystem which has only been used to store a few huge disk
images ...

Post by Mike Snitzer
Can you try reducing the cache blocksize to 32K (lowest we support at
the moment, it'll require you to remove the cache and recreate) to see
if performance for this 64K random IO workload improves? If so it does
start to add weight to my alignment concerns.

... nevertheless what I will do is recreate the origin LV, ext4
filesystem, and change the block size.

You don't need to recreate the origin LV or FS.
If anything that'd reduce our ability to answer what may be currently
wrong with the setup. I was just suggesting removing the cache and
recreating the cache layer. Not saure how easy it is to do that with
the lvm2 interface. Jon and/or Kabi?

Post by Richard W.M. Jones
What is the command to set the cache blocksize? It doesn't seem to be
covered in the documentation anywhere.

I would think it is lvconvert's --chunksize...

Richard W.M. Jones

2014-05-30 15:28:58 UTC

I did in fact recreate the ext4 filesystem, because I didn't read your
email in time.

Here are the commands I used to create the whole lot:

----------------------------------------------------------------------
lvcreate -L 800G -n testorigin vg_guests @slow
mkfs -t ext4 /dev/vg_guests/testorigin
# at this point, I tested the speed of the uncached LV, see below
lvcreate -L 1G -n lv_cache_meta vg_guests @ssd
lvcreate -L 200G -n lv_cache vg_guests @ssd
lvconvert --type cache-pool --chunksize 32k --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testorigin
dmsetup message vg_guests-testorigin 0 sequential_threshold 0
dmsetup message vg_guests-testorigin 0 read_promote_adjustment 0
dmsetup message vg_guests-testorigin 0 write_promote_adjustment 0
# at this point, I tested the speed of the cached LV, see below
----------------------------------------------------------------------

To test the uncached LV, I ran the same fio test twice on the mounted
ext4 filesystem. The results of the second run are in the first
attachment.

To test the cached LV, I ran these commands 3 times in a row:

md5sum virt.*
echo 3 > /proc/sys/vm/drop_caches

then I ran the fio test twice. The results of the second run are
attached.

This time the LVM cache test is about 10% slower than the HDD test.
I'm not sure what to make of that at all.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Mike Snitzer

2014-05-30 18:16:48 UTC

On Fri, May 30 2014 at 11:28am -0400,

Post by Richard W.M. Jones
I did in fact recreate the ext4 filesystem, because I didn't read your
email in time.
----------------------------------------------------------------------
mkfs -t ext4 /dev/vg_guests/testorigin
# at this point, I tested the speed of the uncached LV, see below
lvconvert --type cache-pool --chunksize 32k --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testorigin
dmsetup message vg_guests-testorigin 0 sequential_threshold 0
dmsetup message vg_guests-testorigin 0 read_promote_adjustment 0
dmsetup message vg_guests-testorigin 0 write_promote_adjustment 0
# at this point, I tested the speed of the cached LV, see below
----------------------------------------------------------------------
To test the uncached LV, I ran the same fio test twice on the mounted
ext4 filesystem. The results of the second run are in the first
attachment.
md5sum virt.*
echo 3 > /proc/sys/vm/drop_caches
then I ran the fio test twice. The results of the second run are
attached.
This time the LVM cache test is about 10% slower than the HDD test.
I'm not sure what to make of that at all.

It could be that the 32k cache blocksize increased the metadata overhead
enough to reduce the performance to that degree.

And even though you recreated the filesystem it still could be the case
that the IO issued from ext4 is slightly misaligned. I'd welcome you
going to back to a blocksize of 64K (you don't _need_ to go to 64K but it
seems you're giving up quite a bit of performance now). And then
collecting blktraces of the origin volume for the fio run -- to see if
64K * 2 IOs are being issued for each 64K fio IO. I would think it
would be fairly clear from the blktrace but maybe not.

It could be that a targeted debug line in dm-cache would serve as a
better canary for whether misalignment is a concern. I'll see if I can
come up with a patch that helps us assess misalignment.

Joe Thornber will be back from holiday on Monday so we may get some
additional insight from him soon enough.

Sorry for your troubles but this is good feedback.

Mike Snitzer

2014-05-30 20:53:59 UTC

On Fri, May 30 2014 at 2:16pm -0400,

Post by Mike Snitzer
On Fri, May 30 2014 at 11:28am -0400,

Post by Richard W.M. Jones
This time the LVM cache test is about 10% slower than the HDD test.
I'm not sure what to make of that at all.

It could be that the 32k cache blocksize increased the metadata overhead
enough to reduce the performance to that degree.
And even though you recreated the filesystem it still could be the case
that the IO issued from ext4 is slightly misaligned. I'd welcome you
going to back to a blocksize of 64K (you don't _need_ to go to 64K but it
seems you're giving up quite a bit of performance now). And then
collecting blktraces of the origin volume for the fio run -- to see if
64K * 2 IOs are being issued for each 64K fio IO. I would think it
would be fairly clear from the blktrace but maybe not.

Thinking about this a little more: if the IO that ext4 is issuing to the
cache is aligned on a blocksize boundary (e.g. 64K) we really shouldn't
see _any_ IO from the origin device when you are running fio. The
reason is we avoid promoting (aka copying) from the origin if an entire
cache block is being overwritten.

Looking at the fio output from the cache run you did using the 32K
blocksize it is very clear that the MD array (on sda and sdb) is
involved quite a lot.

And your even older fio run output when using the original 64K blocksize
shows a bunch of IO to md127...

So it seems fairly clear that dm-cache isn't utilizing the cache block
overwrite optimization it has to avoid promotions from the origin. This
would _seem_ to validate my concern about alignment.. or something else
needs to explain why we're not able to avoid promotions.

If you have time to reconfigure with 64K blocksize and rerun the fio
test, please look at the amount of write IO performed by md127 (and sda
and sdb).. and also look at the number of promotions, via 'dmsetup
status' for the cache device, before and after the fio run.

We can try to reproduce using a pristine ext4 filesystem ontop of
MD with the fio job you provided... and I'm now wondering if we're
getting bitten by DM stacked on MD (due to bvec merge being limited to 1
page, see linux.git commit 8cbeb67a for some additional context). So it
may be worth trying _without_ MD raid1 just as a test. Use either sda
or sdb directly as the origin volume.

Mike Snitzer

2014-05-30 13:55:29 UTC

On Fri, May 30 2014 at 9:46am -0400,

Post by Richard W.M. Jones
I have now set both read_promote_adjustment ==
write_promote_adjustment == 0 and used drop_caches between runs.
I also read Documentation/device-mapper/cache-policies.txt at Heinz's
suggestion.
I'm afraid the performance of the fio test is still not the same as
the SSD (4.8 times slower than the SSD-only test now).

Obviously not what we want. But you're not doing any repeated IO to
those blocks.. it is purely random right?

So really, the cache is waiting for blocks to get promoted from the
origin if the IOs from fio don't completely cover the cache block size
you've specified.

Can you go back over those settings?

Richard W.M. Jones

2014-05-30 14:29:48 UTC

So unless you have misaligned IO you _should_ be able to avoid reading
from the origin. But XFS is in play here.. I'm wondering if it is

The filesystem is ext4.

If you set read_promote_adjustment to 0 it should pull the associated
blocks into the cache. What makes you think it isn't?

The fio test is about twice as fast as when I ran the fio test
directly on the hard disk array. This test runs about 5 times slower
than when I ran it directly on the SSD.

I'm not measuring the speed of the md5sum operation.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Mike Snitzer

2014-05-30 14:36:16 UTC

On Fri, May 30 2014 at 10:29am -0400,

Post by Richard W.M. Jones

So unless you have misaligned IO you _should_ be able to avoid reading
from the origin. But XFS is in play here.. I'm wondering if it is

The filesystem is ext4.

OK, so I have even more concern about misalignment then. At least XFS
goes to great lengths to build large IOs if Direct IO is used (via
bio_add_page, the optimal io size is used to build the IO up).

I'm not aware of ext4 taking similar steps but it could be it does now
(I vaguely remember ext4 borrowing heavily from XFS at one point,
could've been for direct IO).

We need better tools for assessing whether the IO is misaligned. But
for now we'd have to start with looking at blktrace data to the
underlying origin device. If we keep seeing >64K sequential IOs to the
origin that would speak to dm-cache pulling in 2 64K blocks from the
origin.

Mike Snitzer

2014-05-30 11:53:45 UTC

On Thu, May 29 2014 at 5:58pm -0400,

Post by Mike Snitzer
BTW, this is all with a eye toward realizing the optimization that
dm-cache provides for origin blocks that were discarded (like I said
before dm-cache doesn't promote from the origin if the corresponding
block was marked for discard). So you don't _need_ to do any of
this.. purely about trying to optimize a bit more.

And if you do make use of discards, you should have this stable fix
applied to your kernel:

https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-linus&id=f1daa838e861ae1a0fb7cd9721a21258430fcc8c

Alasdair G Kergon

2014-05-30 11:38:59 UTC

Post by Richard W.M. Jones
Is there a reason why fast and slow devices need to be in the same VG?
I've talked to two other people who found this very confusing. No one
knew that you could manually place LVs into different PVs, and it's
something of a pain to have to remember to place LVs every time you
create or resize one. It seems it would be a lot simpler if you could
have the slow PVs in one VG and the fast PVs in another VG.

We recommend you use tags. Much more flexible/dynamic solution than forcing
the use of separate VGs.

pvchange --addtag ssd
pvs -o+tags

lvcreate ... $vg @ssd

to restrict the allocation the command performs to the PVs with the 'ssd' tag.

Alasdair

Alasdair G Kergon

2014-05-30 11:45:18 UTC

And for lvextend, you should add any tags you are using in this way to lvm.conf:

# When searching for free space to extend an LV, the "cling"
# allocation policy will choose space on the same PVs as the last
# segment of the existing LV. If there is insufficient space and a
# list of tags is defined here, it will check whether any of them are
# attached to the PVs concerned and then seek to match those PV tags
# between existing extents and new extents.
# Use the special tag "@*" as a wildcard to match any PV tag.

# Example: LVs are mirrored between two sites within a single VG.
# PVs are tagged with either @site1 or @site2 to indicate where
# they are situated.

# cling_tag_list = [ "@site1", "@site2" ]
# cling_tag_list = [ "@*" ]

(The "cling" allocation policy is enabled by default.)

Alasdair

Werner Gold

2014-05-30 12:45:15 UTC

Many thanks to Alasdair and Heinz for the hint with the tagging feature.
More convenient than dealing with UUIDs.

I also stumbled across the "same VG" issue when I tried to set up the
test environment. Thanks to Richard for that hint. :-)

I ran bonnie++ on my X230 (RHEL7) here where I used an external USB3
SSD. Attached you find the results.

With cache, there is a significant difference in random create. That's
what I would expect from an SSD cache.

Werner
--
Werner Gold ***@redhat.com
Partner Enablement / EMEA phone: 49.9331.803 855
Steinbachweg 23 fax: +49.9331.4407
97252 Frickenhausen/Main, Germany cell: +49.172.764 4633
Key fingerprint = FF91B07C 6F3D340E A71791AC 5E3A6CB4 D44CBC37

Reg. Adresse: Red Hat GmbH, Werner-von-Siemens-Ring 14, D-85630 Grasbrunn
Handelsregister: Amtsgericht Muenchen HRB 153243
Geschaeftsfuehrer: Mark Hegarty, Charlie Peters, Michael Cunningham,
Charles Cachera

36 Replies
334 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Richard W.M. Jones 2014-05-22 10:18:37 UTC

Zdenek Kabelac 2014-05-22 14:43:49 UTC

Richard W.M. Jones 2014-05-22 15:22:32 UTC

Richard W.M. Jones 2014-05-22 15:49:47 UTC

Mike Snitzer 2014-05-22 18:04:05 UTC

Richard W.M. Jones 2014-05-22 18:13:34 UTC

Richard W.M. Jones 2014-05-29 13:52:46 UTC

Mike Snitzer 2014-05-29 20:34:10 UTC

Richard W.M. Jones 2014-05-29 20:47:20 UTC

Mike Snitzer 2014-05-29 21:06:48 UTC

Richard W.M. Jones 2014-05-29 21:19:55 UTC

Mike Snitzer 2014-05-29 21:58:15 UTC

Richard W.M. Jones 2014-05-30 09:04:22 UTC

Richard W.M. Jones 2014-05-30 10:30:31 UTC

Mike Snitzer 2014-05-30 13:38:14 UTC

Richard W.M. Jones 2014-05-30 13:40:20 UTC

Heinz Mauelshagen 2014-05-30 13:42:23 UTC

Richard W.M. Jones 2014-05-30 13:54:07 UTC

Zdenek Kabelac 2014-05-30 13:58:00 UTC

Richard W.M. Jones 2014-05-30 13:46:42 UTC

Heinz Mauelshagen 2014-05-30 13:54:49 UTC

Richard W.M. Jones 2014-05-30 14:26:02 UTC

Mike Snitzer 2014-05-30 14:29:26 UTC

Richard W.M. Jones 2014-05-30 14:36:59 UTC

Mike Snitzer 2014-05-30 14:44:54 UTC

Richard W.M. Jones 2014-05-30 14:51:52 UTC

Mike Snitzer 2014-05-30 14:58:03 UTC

Richard W.M. Jones 2014-05-30 15:28:58 UTC

Mike Snitzer 2014-05-30 18:16:48 UTC

Mike Snitzer 2014-05-30 20:53:59 UTC

Mike Snitzer 2014-05-30 13:55:29 UTC

Richard W.M. Jones 2014-05-30 14:29:48 UTC

Mike Snitzer 2014-05-30 14:36:16 UTC

Mike Snitzer 2014-05-30 11:53:45 UTC

Alasdair G Kergon 2014-05-30 11:38:59 UTC

Alasdair G Kergon 2014-05-30 11:45:18 UTC

Werner Gold 2014-05-30 12:45:15 UTC

about - legalese

Loading...