[linux-lvm] Thin Pool Performance

Discussion:

shankha

2016-04-19 01:05:18 UTC

Hi,
Please allow me to describe our setup.

1) 8 SSDS with a raid5 on top of it. Let us call the raid device : dev_raid5
2) We create a Volume Group on dev_raid5
3) We create a thin pool occupying 100% of the volume group.

We performed some experiments.

Our random write operations dropped by half and there was significant
reduction for
other operations(sequential read, sequential write, random reads) as
well compared to native raid5

If you wish I can share the data with you.

We then changed our configuration from one POOL to 4 POOLS and were able to
get back to 80% of the performance (compared to native raid5).

To us it seems that the lvm metadata operations are the bottleneck.

Do you have any suggestions on how to get back the performance with lvm ?

LVM version: 2.02.130(2)-RHEL7 (2015-12-01)
Library version: 1.02.107-RHEL7 (2015-12-01)

Thanks

Zdenek Kabelac

2016-04-19 08:11:25 UTC

Permalink

Post by shankha
Hi,
Please allow me to describe our setup.
1) 8 SSDS with a raid5 on top of it. Let us call the raid device : dev_raid5
2) We create a Volume Group on dev_raid5
3) We create a thin pool occupying 100% of the volume group.
We performed some experiments.
Our random write operations dropped by half and there was significant
reduction for
other operations(sequential read, sequential write, random reads) as
well compared to native raid5
If you wish I can share the data with you.
We then changed our configuration from one POOL to 4 POOLS and were able to
get back to 80% of the performance (compared to native raid5).
To us it seems that the lvm metadata operations are the bottleneck.
Do you have any suggestions on how to get back the performance with lvm ?
LVM version: 2.02.130(2)-RHEL7 (2015-12-01)
Library version: 1.02.107-RHEL7 (2015-12-01)

Hi

Thanks for playing with thin-pool, however your report is largely incomplete.

We do not see you actual VG setup.

Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep it
disabled), chunk size (use bigger chunks if you do not need snapshots), number
of simultaneously active thin volumes in single thin-pool (running hundreds of
loaded thinLV is going to loose battle on locking) , size of thin pool
metadata LV - is this LV located on separate device (you should not use RAID5
with metatadata)
and what kind of workload you try on ?

Regards

Zdenek

shankha

2016-04-20 13:34:45 UTC

Permalink

Hi,
I had just one thin logical volume and running fio benchmarks. I tried
having the metadata on a raid0. There was minimal increase in
performance. I had thin pool zeroing switched on. If I switch off
thin pool zeroing initial allocations were faster but the final
numbers are almost similar. The size of the thin poll metadata LV was
16 GB.
Thanks
Shankha Banerjee

Post by Zdenek Kabelac

Hi
Thanks for playing with thin-pool, however your report is largely incomplete.
We do not see you actual VG setup.
Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep
it disabled), chunk size (use bigger chunks if you do not need snapshots),
number of simultaneously active thin volumes in single thin-pool (running
hundreds of loaded thinLV is going to loose battle on locking) , size of
thin pool metadata LV - is this LV located on separate device (you should
not use RAID5 with metatadata)
and what kind of workload you try on ?
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

shankha

2016-04-20 15:55:59 UTC

Permalink

I am sorry. I forgot to post the workload.

The fio benchmark configuration.

[zipf write]
direct=1
rw=randrw
ioengine=libaio
group_reporting
rwmixread=0
bs=4k
iodepth=32
numjobs=8
runtime=3600
random_distribution=zipf:1.8
Thanks
Shankha Banerjee

Post by shankha
Hi,
I had just one thin logical volume and running fio benchmarks. I tried
having the metadata on a raid0. There was minimal increase in
performance. I had thin pool zeroing switched on. If I switch off
thin pool zeroing initial allocations were faster but the final
numbers are almost similar. The size of the thin poll metadata LV was
16 GB.
Thanks
Shankha Banerjee

Post by Zdenek Kabelac

Hi
Thanks for playing with thin-pool, however your report is largely incomplete.
We do not see you actual VG setup.
Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep
it disabled), chunk size (use bigger chunks if you do not need snapshots),
number of simultaneously active thin volumes in single thin-pool (running
hundreds of loaded thinLV is going to loose battle on locking) , size of
thin pool metadata LV - is this LV located on separate device (you should
not use RAID5 with metatadata)
and what kind of workload you try on ?
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

shankha

2016-04-20 19:50:31 UTC

Permalink

Chunk size for lvm was 64K.
Thanks
Shankha Banerjee

Post by shankha
I am sorry. I forgot to post the workload.
The fio benchmark configuration.
[zipf write]
direct=1
rw=randrw
ioengine=libaio
group_reporting
rwmixread=0
bs=4k
iodepth=32
numjobs=8
runtime=3600
random_distribution=zipf:1.8
Thanks
Shankha Banerjee

Post by Zdenek Kabelac

Hi
Thanks for playing with thin-pool, however your report is largely incomplete.
We do not see you actual VG setup.
Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep
it disabled), chunk size (use bigger chunks if you do not need snapshots),
number of simultaneously active thin volumes in single thin-pool (running
hundreds of loaded thinLV is going to loose battle on locking) , size of
thin pool metadata LV - is this LV located on separate device (you should
not use RAID5 with metatadata)
and what kind of workload you try on ?
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Marian Csontos

2016-04-28 10:20:37 UTC

Permalink

Post by shankha
Chunk size for lvm was 64K.

What's the stripe size?
Does 8 disks in RAID5 mean 7x data + 1x parity?

If so, 64k chunk cannot be aligned with RAID5 stripe size and each write
is potentially rewriting 2 stripes - rather painful for random writes as
this means to write 4k of data, 64k are allocated and that requires 2
stripes - almost twice the amount of written data to pure RAID.

-- Martian

Post by shankha
Thanks
Shankha Banerjee

Post by Zdenek Kabelac

Hi
Thanks for playing with thin-pool, however your report is largely incomplete.
We do not see you actual VG setup.
Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep
it disabled), chunk size (use bigger chunks if you do not need snapshots),
number of simultaneously active thin volumes in single thin-pool (running
hundreds of loaded thinLV is going to loose battle on locking) , size of
thin pool metadata LV - is this LV located on separate device (you should
not use RAID5 with metatadata)
and what kind of workload you try on ?
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

shankha

2016-04-29 15:37:27 UTC

Permalink

Hi Martin,
I did not specify the strip size for raid. By default I assume it is 512K.
8 disks mean 7x Data + 1x Parity.
Thanks
Shankha Banerjee

Post by Marian Csontos

Post by shankha
Chunk size for lvm was 64K.

What's the stripe size?
Does 8 disks in RAID5 mean 7x data + 1x parity?
If so, 64k chunk cannot be aligned with RAID5 stripe size and each write is
potentially rewriting 2 stripes - rather painful for random writes as this
means to write 4k of data, 64k are allocated and that requires 2 stripes -
almost twice the amount of written data to pure RAID.
-- Martian

Post by shankha
Thanks
Shankha Banerjee

Post by Zdenek Kabelac

Post by shankha
Hi,
Please allow me to describe our setup.
1) 8 SSDS with a raid5 on top of it. Let us call the raid device : dev_raid5
2) We create a Volume Group on dev_raid5
3) We create a thin pool occupying 100% of the volume group.
We performed some experiments.
Our random write operations dropped by half and there was significant
reduction for
other operations(sequential read, sequential write, random reads) as
well compared to native raid5
If you wish I can share the data with you.
We then changed our configuration from one POOL to 4 POOLS and were
able
to
get back to 80% of the performance (compared to native raid5).
To us it seems that the lvm metadata operations are the bottleneck.
Do you have any suggestions on how to get back the performance with lvm ?
LVM version: 2.02.130(2)-RHEL7 (2015-12-01)
Library version: 1.02.107-RHEL7 (2015-12-01)

Hi
Thanks for playing with thin-pool, however your report is largely incomplete.
We do not see you actual VG setup.
Please attach 'vgs/lvs' i.e. thin-pool zeroing (if you don't need it keep
it disabled), chunk size (use bigger chunks if you do not need snapshots),
number of simultaneously active thin volumes in single thin-pool (running
hundreds of loaded thinLV is going to loose battle on locking) , size of
thin pool metadata LV - is this LV located on separate device (you should
not use RAID5 with metatadata)
and what kind of workload you try on ?
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Linda A. Walsh

2016-04-26 17:38:18 UTC

Permalink

----

What is 'native raid 5', Do you mean using the kernel-software
driver for RAID5, or do you mean using a hardware RAID solution like an
LSI card that does the RAID checksumming and writes in background
(presuming you have 'Write-Back' enabled and have the RAID-card's RAM
battery-backed up). To write the data stripe on 1 data-disk, RAID has
to read the data-disks of all the other data-disks in the array in order
to compute a "checksum" (often/usually XOR). The only possible speed
benefits on RAID5 and RAID6 are in reading. Writes will be slower than
RAID1. Also, I presume the partitioning, disk-brand, and lvm layout on
disk is exactly the same for each disk(?), and assume these are
Enterprise grade drives (no 'Deskstars', for example, only 'Ultrastars'
if you go w/Hitachi.

The reason for the latter is that desktop drives vary their
spin-rate by up to 15-20% (one might be spinning at 7800RPM, another at
6800RPM. With enterprise grade drives, I've never seen a measurable
difference in spin speed. Also, desktop drives are not guaranteed to to
already have some sectors remapped upon initial purchase. In other
words, today's disks reserve some capacity for remapping tracks and
sectors. If a read detects a fail and but can still recover using the
ECC recover data, then it can virtually move that sector (or track) to
a spare. However, what *that* means is that the disk with the bad
sector or track has to seek to an "extra space section" on the hard disk
to fetch the data, then seek back to the original location "+1" to read
the next sector.

That means the 1 drive will take noticeable longer to do the same
read (or write) as the rest.

Most Software-based raid solutions, will accept alot of sloppiness
in diskspeed variation. But as an example -- I once accidentally
received a dozen Hitachi deskstar (consumer line) drives instead of the
Enterprise-line, "Ultrastars". My hardware RAID card (LSI) pretests
basic parameters of each disk inserted. Only 2 out of 12 disks were
considered to "pass" the self check -- meaning 10/12 or over 80% will
show sub-optimal performance compared to Enterprise-grade drives. So in
my case, I can't even use disks that are too far out of spec, compared
to the case of most software drivers that simply 'wait' for all the data
to arrive, which can kill performance even on reads. I've been told
that many of the HW-RAID cards will know where each disk's head is --
not just by track, but also where in the track it is spinning.

The optimal solution is, of course the most costly -- using a RAID10
solution, where out of 12 disks, you create 6 RAID1 mirrors, then stripe
those 6 mirrors as a RAID0. However, I *feel* less safe, since if
I have RAID 6 I can lose 2 disks and still read+recover my data, but if
I lost 2 disks on RAID10, If they are the same RAID1-pair, then I'm
screwed.

Lvm was designed as a *volume manager* -- it wasn't _designed_ to be
a RAID solution, **though it is increasingly becoming used as such**.
Downsides -- in a RAID5 or 6, You can stripe RAID5 sets as RAID50 and
RAID6 sets as RAID60, it is still the case that all of those I/O's need
to be done to compute the correct checksum. At the kernel SW-driver
level, I am pretty sure its standard to compute multiple segments in
a RAID50 (i.e. one might have 4 drives setup as RAID5, then w/12 disks,
one can stripe those giving fairly fast READ performance) at the same
time using multiple-cores. So if you have a 4-core machine
3 of those cores can be used to compute the XOR of the 3 segments of
your RAID5. I have no idea if lvm is capable of using parallel kernel
threads for such, since there is more of lvm's code (I believe) in
"user-space". Another consideration, as you go to higher models of HW
raid cards, they often contain more processors on the RAID card. My
last RAID card had 1 I/O processor, vs. my newer one has 2 I/O-CPU's on
the card, which can really help in write speeds.

Also of significance is whether or not the HW RAID card has it's own
cache memory and whether or not it is battery backed-up. If it is, then
it can be safe to do 'write-back' processing, where the data first goes
into the card's memory and is written back to disk later on (much faster
option), vs. if there is no battery backup or UPS, many LSI cards will
automatically switch over to "Write-through" -- where it writes the data
to disk and doesn't return to the user until the write-to-disk is
complete(slower but safer).

So the fact that RAID5 under any circumstance would be slower in
writes is *normal*. To optimize speed, one needs to make sure the disks
are same make+model and are "Enterprise grade" (I use 7200RPM SATA
drives -- don't need SAS for RAIDs). You also need to make sure all
partitions, lvm-parameters and FS-parameters are the same for each --
don't even think of trying to put multiple data-disks of the same
meta-partition (combined at the lvm level) on the same disks. That
should give horrible performance -- yuck.

Sorry for the long post, but I think I'm buzzing w/too much
caffiene. :-)
-linda

Continue reading on narkive:

Search results for '[linux-lvm] Thin Pool Performance' (Questions and Answers)

replies

How can I get local pool company to complete pool construction begun in april 27, still waiting for plaster...

started 2006-07-06 16:41:19 UTC

home & garden

replies

Ladies only please! importantttt.?

started 2010-07-15 21:23:15 UTC

women's health

replies

Pool maintainance????