Discussion:
[linux-lvm] cache on SSD makes system unresponsive
Oleg Cherkasov
2017-10-19 17:54:34 UTC
Permalink
Hi,

Recently I have decided to try out LVM cache feature on one of our Dell
NX3100 servers running CentOS 7.4.1708 with 110Tb disk array (hardware
RAID5 with H710 and H830 Dell adapters). Two SSD disks each 256Gb are
in hardware RAID1 using H710 adapter with primary and extended
partitions so I decided to make ~240Gb LVM cache to see if system I/O
may be improved. The server is running Bareos storage daemon and beside
sshd and Dell OpenManage monitoring does not have any other services.
Unfortunately testing went not as I expected nonetheless at the end
system is up and running with no data corrupted.

Initially I have tried the default writethrough mode and after running
dd reading test with 250Gb file got system unresponsive for roughly
15min with cache allocation around 50%. Writing to disks it seems speed
up the system however marginally, so around 10% on my tests and I did
manage to pull more than 32Tb via backup from different hosts and once
system became unresponsive to ssh and icmp requests however for a very
short time.

I though it may be something with cache mode so switched to writeback
via lvconvert and run dd reading test again with 250Gb file however that
time everything went completely unexpected. System started to slow
responding for simple user interactions like list files and run top. And
then became completely unresponsive for about half an hours. Switching
to main console via iLO I saw a lot of OOM messages and kernel tried to
survive therefore randomly killed almost all processes. Eventually I
did manage to reboot and immediately uncached the array.

My question is about very strange behavior of LVM cache. Well, I may
expect no performance boost or even I/O degradation however I do not
expect run out of memory and than OOM kicks in. That server has only
12Gb RAM however it does run only sshd, bareos SD daemon and OpenManange
java based monitoring system so no RAM problems were notices for last
few years running with our LVM cache.

Any ideas what may be wrong? I have second NX3200 server with similar
hardware setup and it would be switch to FreeBSD 11.1 with ZFS very time
soon however I may try to install CentOS 7.4 first and see if the
problem may be reproduced.

LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.


Thank you!
Oleg
Xen
2017-10-19 18:13:15 UTC
Permalink
Oleg Cherkasov schreef op 19-10-2017 19:54:

> Any ideas what may be wrong?

All I know myself in the past have tried to cache an embedded encrypted
LVM in a regular home system.

The problem was probably caused by the SSD not clearing write caches
fast enough but I too got some 2 minute "hanging process" outputs on the
console.

So it was probably a queueing issue within the kernel and might not have
been related to the cache,

but I'm still not sure if there wasn't an interplay at work.

The main cause was a way too slow SSD but at the same time... that sorta
thing still shouldn't happen, locking up the entire system.

I haven't had a chance to try again with a faster SSD.

Regards...
Oleg Cherkasov
2017-10-20 10:21:44 UTC
Permalink
On 19. okt. 2017 20:13, Xen wrote:
>
> The main cause was a way too slow SSD but at the same time... that sorta
> thing still shouldn't happen, locking up the entire system.
>
> I haven't had a chance to try again with a faster SSD.

I have double checked with MegaRAID/CLI and all disks on that rig
(including SSD ones of course) are SAS 6Gb/s both devices and links. My
first thought about those SSDs was that those are slower than RAID5
however it seems not the case.

Could it be TRIMing issue because those are from 2012?
Xen
2017-10-20 10:38:50 UTC
Permalink
Oleg Cherkasov schreef op 20-10-2017 10:21:
> On 19. okt. 2017 20:13, Xen wrote:
>>
>> The main cause was a way too slow SSD but at the same time... that
>> sorta thing still shouldn't happen, locking up the entire system.
>>
>> I haven't had a chance to try again with a faster SSD.
>
> I have double checked with MegaRAID/CLI and all disks on that rig
> (including SSD ones of course) are SAS 6Gb/s both devices and links.
> My first thought about those SSDs was that those are slower than RAID5
> however it seems not the case.
>
> Could it be TRIMing issue because those are from 2012?

You mean that the SATA version is too low to interleave TRIMs with data
access?

Because I think that was the case with my mSata SSD.

I don't currently remember the sata version that allowed interleaving
but that SSD didn't reach or have it.

After trimming performance would go up greatly.

So I don't know about SAS but it might be similar right.
Oleg Cherkasov
2017-10-20 11:41:41 UTC
Permalink
On 20. okt. 2017 12:38, Xen wrote:
> Oleg Cherkasov schreef op 20-10-2017 10:21:
>> On 19. okt. 2017 20:13, Xen wrote:
>>
>> Could it be TRIMing issue because those are from 2012?
>
> You mean that the SATA version is too low to interleave TRIMs with data
> access?

I think SSDs from different vendors handle trimming differently
regardless SAS or SATA. It is just my hypothesis if trimming is a cause
of the problem of course.
Mike Snitzer
2017-10-19 18:49:16 UTC
Permalink
On Thu, Oct 19 2017 at 1:54pm -0400,
Oleg Cherkasov <***@member.fsf.org> wrote:

> Hi,
>
> Recently I have decided to try out LVM cache feature on one of our
> Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk array
> (hardware RAID5 with H710 and H830 Dell adapters). Two SSD disks
> each 256Gb are in hardware RAID1 using H710 adapter with primary and
> extended partitions so I decided to make ~240Gb LVM cache to see if
> system I/O may be improved. The server is running Bareos storage
> daemon and beside sshd and Dell OpenManage monitoring does not have
> any other services. Unfortunately testing went not as I expected
> nonetheless at the end system is up and running with no data
> corrupted.
>
> Initially I have tried the default writethrough mode and after
> running dd reading test with 250Gb file got system unresponsive for
> roughly 15min with cache allocation around 50%. Writing to disks it
> seems speed up the system however marginally, so around 10% on my
> tests and I did manage to pull more than 32Tb via backup from
> different hosts and once system became unresponsive to ssh and icmp
> requests however for a very short time.
>
> I though it may be something with cache mode so switched to
> writeback via lvconvert and run dd reading test again with 250Gb
> file however that time everything went completely unexpected.
> System started to slow responding for simple user interactions like
> list files and run top. And then became completely unresponsive for
> about half an hours. Switching to main console via iLO I saw a lot
> of OOM messages and kernel tried to survive therefore randomly
> killed almost all processes. Eventually I did manage to reboot and
> immediately uncached the array.
>
> My question is about very strange behavior of LVM cache. Well, I
> may expect no performance boost or even I/O degradation however I do
> not expect run out of memory and than OOM kicks in. That server has
> only 12Gb RAM however it does run only sshd, bareos SD daemon and
> OpenManange java based monitoring system so no RAM problems were
> notices for last few years running with our LVM cache.
>
> Any ideas what may be wrong? I have second NX3200 server with
> similar hardware setup and it would be switch to FreeBSD 11.1 with
> ZFS very time soon however I may try to install CentOS 7.4 first and
> see if the problem may be reproduced.
>
> LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.

Your experience is _not_ unique. It is unfortunate but there would seem
to be some systemic issues with dm-cache being too resoruce heavy. Not
aware of any particular issue(s) yet.

I'm focusing on this now since we've had some internal reports that
writeback is quite slow (and that tests don't complete). That IO
latencies are high. Etc.

I'll work through it and likely enlist Joe Thornber's help next week.

I'll keep you posted as progress is made though.

Thanks,
Mike
Joe Thornber
2017-10-20 11:07:50 UTC
Permalink
I can't look at this until Sunday. But if it's something that only
exhibits in writeback mode rather than writethrough, then I'd guess it's to
do with the list of writeback work that the policy builds. So check
whether the list is growing endlessly, and check the work object is being
freed once the copy has completed.

On Thu, 19 Oct 2017 at 19:49 Mike Snitzer <***@redhat.com> wrote:

> On Thu, Oct 19 2017 at 1:54pm -0400,
> Oleg Cherkasov <***@member.fsf.org> wrote:
>
> > Hi,
> >
> > Recently I have decided to try out LVM cache feature on one of our
> > Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk array
> > (hardware RAID5 with H710 and H830 Dell adapters). Two SSD disks
> > each 256Gb are in hardware RAID1 using H710 adapter with primary and
> > extended partitions so I decided to make ~240Gb LVM cache to see if
> > system I/O may be improved. The server is running Bareos storage
> > daemon and beside sshd and Dell OpenManage monitoring does not have
> > any other services. Unfortunately testing went not as I expected
> > nonetheless at the end system is up and running with no data
> > corrupted.
> >
> > Initially I have tried the default writethrough mode and after
> > running dd reading test with 250Gb file got system unresponsive for
> > roughly 15min with cache allocation around 50%. Writing to disks it
> > seems speed up the system however marginally, so around 10% on my
> > tests and I did manage to pull more than 32Tb via backup from
> > different hosts and once system became unresponsive to ssh and icmp
> > requests however for a very short time.
> >
> > I though it may be something with cache mode so switched to
> > writeback via lvconvert and run dd reading test again with 250Gb
> > file however that time everything went completely unexpected.
> > System started to slow responding for simple user interactions like
> > list files and run top. And then became completely unresponsive for
> > about half an hours. Switching to main console via iLO I saw a lot
> > of OOM messages and kernel tried to survive therefore randomly
> > killed almost all processes. Eventually I did manage to reboot and
> > immediately uncached the array.
> >
> > My question is about very strange behavior of LVM cache. Well, I
> > may expect no performance boost or even I/O degradation however I do
> > not expect run out of memory and than OOM kicks in. That server has
> > only 12Gb RAM however it does run only sshd, bareos SD daemon and
> > OpenManange java based monitoring system so no RAM problems were
> > notices for last few years running with our LVM cache.
> >
> > Any ideas what may be wrong? I have second NX3200 server with
> > similar hardware setup and it would be switch to FreeBSD 11.1 with
> > ZFS very time soon however I may try to install CentOS 7.4 first and
> > see if the problem may be reproduced.
> >
> > LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.
>
> Your experience is _not_ unique. It is unfortunate but there would seem
> to be some systemic issues with dm-cache being too resoruce heavy. Not
> aware of any particular issue(s) yet.
>
> I'm focusing on this now since we've had some internal reports that
> writeback is quite slow (and that tests don't complete). That IO
> latencies are high. Etc.
>
> I'll work through it and likely enlist Joe Thornber's help next week.
>
> I'll keep you posted as progress is made though.
>
> Thanks,
> Mike
>
John Stoffel
2017-10-19 19:09:24 UTC
Permalink
Oleg> Recently I have decided to try out LVM cache feature on one of
Oleg> our Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk
Oleg> array (hardware RAID5 with H710 and H830 Dell adapters). Two
Oleg> SSD disks each 256Gb are in hardware RAID1 using H710 adapter
Oleg> with primary and extended partitions so I decided to make ~240Gb
Oleg> LVM cache to see if system I/O may be improved. The server is
Oleg> running Bareos storage daemon and beside sshd and Dell
Oleg> OpenManage monitoring does not have any other services.
Oleg> Unfortunately testing went not as I expected nonetheless at the
Oleg> end system is up and running with no data corrupted.

Can you give more details about the system. Is this providing storage
services (NFS) or is it just a backup server?

How did you setup your LVM config and your cache config? Did you
mirror the two SSDs using MD, then add the device into your VG and use
that to setup the lvcache?

I ask because I'm running lvcache at home on my main file/kvm server
and I've never seen this problem. But! I suspect you're running a
much older kernel, lvm config, etc. Please post the full details of
your system if you can.

Oleg> Initially I have tried the default writethrough mode and after
Oleg> running dd reading test with 250Gb file got system unresponsive
Oleg> for roughly 15min with cache allocation around 50%. Writing to
Oleg> disks it seems speed up the system however marginally, so around
Oleg> 10% on my tests and I did manage to pull more than 32Tb via
Oleg> backup from different hosts and once system became unresponsive
Oleg> to ssh and icmp requests however for a very short time.

Can you run 'top' or 'vmstat -admt 10' on the console while you're
running your tests to see what the system does? How does memory look
on this system when you're NOT runnig lvcache?

Do you have any swap space configured on the system? It might make
sense to allocate 10-20gb of swap space.

Oleg> I though it may be something with cache mode so switched to writeback
Oleg> via lvconvert and run dd reading test again with 250Gb file however that
Oleg> time everything went completely unexpected. System started to slow
Oleg> responding for simple user interactions like list files and run top. And
Oleg> then became completely unresponsive for about half an hours. Switching
Oleg> to main console via iLO I saw a lot of OOM messages and kernel tried to
Oleg> survive therefore randomly killed almost all processes. Eventually I
Oleg> did manage to reboot and immediately uncached the array.

Oleg> My question is about very strange behavior of LVM cache. Well, I may
Oleg> expect no performance boost or even I/O degradation however I do not
Oleg> expect run out of memory and than OOM kicks in. That server has only
Oleg> 12Gb RAM however it does run only sshd, bareos SD daemon and OpenManange
Oleg> java based monitoring system so no RAM problems were notices for last
Oleg> few years running with our LVM cache.

Oleg> Any ideas what may be wrong? I have second NX3200 server with similar
Oleg> hardware setup and it would be switch to FreeBSD 11.1 with ZFS very time
Oleg> soon however I may try to install CentOS 7.4 first and see if the
Oleg> problem may be reproduced.

Oleg> LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.


Oleg> Thank you!
Oleg> Oleg

Oleg> _______________________________________________
Oleg> linux-lvm mailing list
Oleg> linux-***@redhat.com
Oleg> https://www.redhat.com/mailman/listinfo/linux-lvm
Oleg> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Xen
2017-10-19 19:46:20 UTC
Permalink
John Stoffel schreef op 19-10-2017 21:09:

> How did you setup your LVM config and your cache config? Did you
> mirror the two SSDs using MD

He said he used hardware RAID to mirror the devices.

> I ask because I'm running lvcache at home on my main file/kvm server
> and I've never seen this problem. But! I suspect you're running a
> much older kernel, lvm config, etc.

lvm2-2.02.171-8.el7.x86_64

CentOS 7.4 was released a month ago.
John Stoffel
2017-10-19 21:14:27 UTC
Permalink
>>>>> "Xen" == Xen <***@xenhideout.nl> writes:

Xen> John Stoffel schreef op 19-10-2017 21:09:
>> How did you setup your LVM config and your cache config? Did you
>> mirror the two SSDs using MD

Xen> He said he used hardware RAID to mirror the devices.

Ok, missed that. But still we need the LVM config info and details on
the system config to address these issues. I suspect he's not running
with any swap configured as well, and something is pushing the system
over the line. But it's hard to know.

Any output from 'dmesg' you can share? The more detailed the better.


>> I ask because I'm running lvcache at home on my main file/kvm server
>> and I've never seen this problem. But! I suspect you're running a
>> much older kernel, lvm config, etc.

Xen> lvm2-2.02.171-8.el7.x86_64

Xen> CentOS 7.4 was released a month ago.

And RHEL7.4/CentOS 7 is all based on kernel 3.14 (I think) with lots
of RedHat specific backports. So knowing the full details will only
help us provide help to him.
Xen
2017-10-20 06:42:41 UTC
Permalink
John Stoffel schreef op 19-10-2017 23:14:

> And RHEL7.4/CentOS 7 is all based on kernel 3.14 (I think) with lots
> of RedHat specific backports. So knowing the full details will only
> help us provide help to him.

Alright I missed that, sorry.

Still given that a Red Hat developer has stated awareness about the
problem that means that other than the kernel it isn't likely that
individual config is going to play a big role.

Also it is likely that anyone in the position to really help would
already recognise the problems.

I just mean to say that it is going to need a developer and is not very
likely that individual config is at fault.

Although a different kernel would see different behaviour, you're right
about that, my apologies.
Oleg Cherkasov
2017-10-19 21:59:15 UTC
Permalink
On 19. okt. 2017 21:09, John Stoffel wrote:
>
> Oleg> Recently I have decided to try out LVM cache feature on one of
> Oleg> our Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk
> Oleg> array (hardware RAID5 with H710 and H830 Dell adapters). Two
> Oleg> SSD disks each 256Gb are in hardware RAID1 using H710 adapter
> Oleg> with primary and extended partitions so I decided to make ~240Gb
> Oleg> LVM cache to see if system I/O may be improved. The server is
> Oleg> running Bareos storage daemon and beside sshd and Dell
> Oleg> OpenManage monitoring does not have any other services.
> Oleg> Unfortunately testing went not as I expected nonetheless at the
> Oleg> end system is up and running with no data corrupted.
>
> Can you give more details about the system. Is this providing storage
> services (NFS) or is it just a backup server?

It is just a backup server, Bareos Storage Daemon + Dell OpenManage for
LSI RAID cards (Dell's H7XX and H8XX are LSI based). That host
deliberately do no share any files or resources for security reasons, so
no NFS or SMB.

Server has 2x SSD drives by 256Gb each and 10x 3Tb drives. In addition
there are two MD1200 disk arrays attached with 12x 4Tb disks each. All
disks exposed to CentOS as Virtual so there are 4 disks in total:

NAME MAJ:MIN RM SIZE RO TYPE
sda 8:0 0 278.9G 0 disk
├─sda1 8:1 0 500M 0 part /boot
├─sda2 8:2 0 36.1G 0 part
│ ├─centos-swap 253:0 0 11.7G 0 lvm [SWAP]
│ └─centos-root 253:1 0 24.4G 0 lvm
├─sda3 8:3 0 1K 0 part
└─sda5 8:5 0 242.3G 0 part
sdb 8:16 0 30T 0 disk
└─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
sdc 8:32 0 40T 0 disk
└─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
sdd 8:48 0 40T 0 disk
└─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm

RAM 12Gb, swap around 12Gb as well. /dev/sda is a hardware RAID1, the
rest are RAID5.

I did make a cache and cache_meta on /dev/sda5. It used to be a
partition for Bareos spool for quite some time and because after
upgrading to 10GbBASE network I do not need that spooler any more so I
decided to try LVM cache.

> How did you setup your LVM config and your cache config? Did you
> mirror the two SSDs using MD, then add the device into your VG and use
> that to setup the lvcache?
All configs are stock CentOS 7.4 at the moment (incrementally upgraded
from 7.0 of course), so I did not customize or tried to make any
optimization on config.
> I ask because I'm running lvcache at home on my main file/kvm server
> and I've never seen this problem. But! I suspect you're running a
> much older kernel, lvm config, etc. Please post the full details of
> your system if you can.
3.10.0-693.2.2.el7.x86_64

CentOS 7.4, as been pointed by Xen, released about a month ago and I had
updated about a week ago while doing planned maintenance on network so
had a good excuse to reboot it.

> Oleg> Initially I have tried the default writethrough mode and after
> Oleg> running dd reading test with 250Gb file got system unresponsive
> Oleg> for roughly 15min with cache allocation around 50%. Writing to
> Oleg> disks it seems speed up the system however marginally, so around
> Oleg> 10% on my tests and I did manage to pull more than 32Tb via
> Oleg> backup from different hosts and once system became unresponsive
> Oleg> to ssh and icmp requests however for a very short time.
>
> Can you run 'top' or 'vmstat -admt 10' on the console while you're
> running your tests to see what the system does? How does memory look
> on this system when you're NOT runnig lvcache?

Well, it is a production system and I am not planning to cache it again
for test however if any patches would be available then try to run a
similar system test on spare box before converting it to FreeBSD with ZFS.

Nonetheless I tried to run top during the dd reading test however with
in first few minutes I did not notice any issues with RAM. System was
using less then 2Gb of 12GB and the rest are wired (cache/buffers).
After few minutes system became unresponsive even dropping ICMP ping
requests and ssh session frozen and then dropped after time out, so no
way to check top measurements.

I have recovered some of SAR records and I may see the last 20 minutes
SAR did not manage to log anything from 2:40pm to 3:00pm before system
got rebooted and back online at 3:10pm:

User stat:
02:00:01 PM CPU %user %nice %system %iowait %steal
%idle
02:10:01 PM all 0.22 0.00 0.08 0.05 0.00
99.64
02:20:35 PM all 0.21 0.00 5.23 20.58 0.00
73.98
02:30:51 PM all 0.23 0.00 0.43 31.06 0.00
68.27
02:40:02 PM all 0.06 0.00 0.15 18.55 0.00
81.24
Average: all 0.19 0.00 1.54 17.67 0.00
80.61

I/O stat:
02:00:01 PM tps rtps wtps bread/s bwrtn/s
02:10:01 PM 5.27 3.19 2.08 109.29 195.38
02:20:35 PM 4404.80 3841.22 563.58 971542.00 140195.66
02:30:51 PM 1110.49 586.67 523.83 148206.31 131721.52
02:40:02 PM 510.72 211.29 299.43 51321.12 76246.81
Average: 1566.86 1214.43 352.43 306453.67 88356.03

DMs:
02:00:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz
await svctm %util
Average: dev8-0 370.04 853.43 88355.91 241.08 85.32
230.56 1.61 59.54
Average: dev8-16 0.02 0.14 0.02 8.18 0.00
3.71 3.71 0.01
Average: dev8-32 1196.77 305599.78 0.04 255.35 4.26
3.56 0.09 11.28
Average: dev8-48 0.02 0.35 0.06 18.72 0.00
17.77 17.77 0.04
Average: dev253-0 151.59 118.15 1094.56 8.00 13.60
89.71 2.07 31.36
Average: dev253-1 15.01 722.81 53.73 51.73 3.08
204.85 28.35 42.56
Average: dev253-2 1259.48 218411.68 0.07 173.41 0.21
0.16 0.08 9.98
Average: dev253-3 681.29 1.27 87189.52 127.98 163.02
239.29 0.84 57.12
Average: dev253-4 3.83 11.09 18.09 7.61 0.09
22.59 10.72 4.11
Average: dev253-5 1940.54 305599.86 0.07 157.48 8.47
4.36 0.06 11.24

dev253:2 is the cache or actually was ...

Queue stat:
02:00:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
02:10:01 PM 1 302 0.09 0.05 0.05 0
02:20:35 PM 0 568 6.87 9.72 5.28 3
02:30:51 PM 1 569 5.46 6.83 5.83 2
02:40:02 PM 0 568 0.18 2.41 4.26 1
Average: 0 502 3.15 4.75 3.85 2

RAM stat:
02:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
%commit kbactive kbinact kbdirty
02:10:01 PM 256304 11866580 97.89 66860 9181100 2709288
11.10 5603576 5066808 32
02:20:35 PM 185160 11937724 98.47 56712 39104 2725476
11.17 299256 292604 16
02:30:51 PM 175220 11947664 98.55 56712 29640 2730732
11.19 113912 113552 24
02:40:02 PM 11195028 927856 7.65 57504 62416 2696248
11.05 119488 164076 16
Average: 2952928 9169956 75.64 59447 2328065 2715436
11.12 1534058 1409260 22

SWAP stat:
02:00:01 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
02:10:01 PM 12010984 277012 2.25 71828 25.93
02:20:35 PM 11048040 1239956 10.09 88696 7.15
02:30:51 PM 10723456 1564540 12.73 38272 2.45
02:40:02 PM 10716884 1571112 12.79 77928 4.96
Average: 11124841 1163155 9.47 69181 5.95



Cheers,
Oleg
John Stoffel
2017-10-20 19:35:00 UTC
Permalink
>>>>> "Oleg" == Oleg Cherkasov <***@member.fsf.org> writes:

Oleg> On 19. okt. 2017 21:09, John Stoffel wrote:
>>
Oleg> Recently I have decided to try out LVM cache feature on one of
Oleg> our Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk
Oleg> array (hardware RAID5 with H710 and H830 Dell adapters). Two
Oleg> SSD disks each 256Gb are in hardware RAID1 using H710 adapter
Oleg> with primary and extended partitions so I decided to make ~240Gb
Oleg> LVM cache to see if system I/O may be improved. The server is
Oleg> running Bareos storage daemon and beside sshd and Dell
Oleg> OpenManage monitoring does not have any other services.
Oleg> Unfortunately testing went not as I expected nonetheless at the
Oleg> end system is up and running with no data corrupted.
>>
>> Can you give more details about the system. Is this providing storage
>> services (NFS) or is it just a backup server?

Oleg> It is just a backup server, Bareos Storage Daemon + Dell
Oleg> OpenManage for LSI RAID cards (Dell's H7XX and H8XX are LSI
Oleg> based). That host deliberately do no share any files or
Oleg> resources for security reasons, so no NFS or SMB.

Well... if it's a backup server, then I suspect that using caching
won't help much because you're mostly doing streaming writes, with
very few reads. The Cache is designed to help the *read* case more.
And for a backup server, you're writing one or just a couple of
streams at once, which is a fairly ideal state for RAID5.

Oleg> Server has 2x SSD drives by 256Gb each and 10x 3Tb drives. In
Oleg> addition there are two MD1200 disk arrays attached with 12x 4Tb
Oleg> disks each. All disks exposed to CentOS as Virtual so there are
Oleg> 4 disks in total:

Oleg> NAME MAJ:MIN RM SIZE RO TYPE
Oleg> sda 8:0 0 278.9G 0 disk
Oleg> ├─sda1 8:1 0 500M 0 part /boot
Oleg> ├─sda2 8:2 0 36.1G 0 part
Oleg> │ ├─centos-swap 253:0 0 11.7G 0 lvm [SWAP]
Oleg> │ └─centos-root 253:1 0 24.4G 0 lvm
Oleg> ├─sda3 8:3 0 1K 0 part
Oleg> └─sda5 8:5 0 242.3G 0 part
Oleg> sdb 8:16 0 30T 0 disk
Oleg> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
Oleg> sdc 8:32 0 40T 0 disk
Oleg> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
Oleg> sdd 8:48 0 40T 0 disk
Oleg> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm

Oleg> RAM 12Gb, swap around 12Gb as well. /dev/sda is a hardware RAID1, the
Oleg> rest are RAID5.

Interesting, it's all hardware RAID devices from what I can see.

Oleg> I did make a cache and cache_meta on /dev/sda5. It used to be a
Oleg> partition for Bareos spool for quite some time and because after
Oleg> upgrading to 10GbBASE network I do not need that spooler any
Oleg> more so I decided to try LVM cache.

Can you should the *exact* commands you used to make the cache? Are
you using lvcache, or bcache? they're two totally different beasts.
I looked into bcache in the past, but since you can't remove it from
an LV, I decided not to use it. I use lvcache like this:

> sudo lvs data
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
home data Cwi-aoC--- 650.00g home_cache [home_corig]
home_cache data Cwi---C--- 130.00g
local data Cwi-aoC--- 335.00g [localcacheLV] [local_corig]

so I'm wondering exactly which caching setup you're using.

>> How did you setup your LVM config and your cache config? Did you
>> mirror the two SSDs using MD, then add the device into your VG and use
>> that to setup the lvcache?

Oleg> All configs are stock CentOS 7.4 at the moment (incrementally upgraded
Oleg> from 7.0 of course), so I did not customize or tried to make any
Oleg> optimization on config.

Ok, good to know.

>> I ask because I'm running lvcache at home on my main file/kvm server
>> and I've never seen this problem. But! I suspect you're running a
>> much older kernel, lvm config, etc. Please post the full details of
>> your system if you can.
Oleg> 3.10.0-693.2.2.el7.x86_64

Oleg> CentOS 7.4, as been pointed by Xen, released about a month ago
Oleg> and I had updated about a week ago while doing planned
Oleg> maintenance on network so had a good excuse to reboot it.

Oleg> Initially I have tried the default writethrough mode and after
Oleg> running dd reading test with 250Gb file got system unresponsive
Oleg> for roughly 15min with cache allocation around 50%. Writing to
Oleg> disks it seems speed up the system however marginally, so around
Oleg> 10% on my tests and I did manage to pull more than 32Tb via
Oleg> backup from different hosts and once system became unresponsive
Oleg> to ssh and icmp requests however for a very short time.

This isn't good. Can you post more details about your LV setup please?

>> Can you run 'top' or 'vmstat -admt 10' on the console while you're
>> running your tests to see what the system does? How does memory look
>> on this system when you're NOT runnig lvcache?

Oleg> Well, it is a production system and I am not planning to cache
Oleg> it again for test however if any patches would be available then
Oleg> try to run a similar system test on spare box before converting
Oleg> it to FreeBSD with ZFS.

How was the performance before your caching tests? Are you looking
for better compression of your backups? I've used bacula (which
Bareos is based on) for years, but recently gave up because the
restores sucked to do. Sorry for the side note. :-)

Oleg> Nonetheless I tried to run top during the dd reading test
Oleg> however with in first few minutes I did not notice any issues
Oleg> with RAM. System was using less then 2Gb of 12GB and the rest
Oleg> are wired (cache/buffers). After few minutes system became
Oleg> unresponsive even dropping ICMP ping requests and ssh session
Oleg> frozen and then dropped after time out, so no way to check top
Oleg> measurements.

Any messages from the console?

Oleg> I have recovered some of SAR records and I may see the last 20 minutes
Oleg> SAR did not manage to log anything from 2:40pm to 3:00pm before system
Oleg> got rebooted and back online at 3:10pm:

Oleg> User stat:
Oleg> 02:00:01 PM CPU %user %nice %system %iowait %steal
Oleg> %idle
Oleg> 02:10:01 PM all 0.22 0.00 0.08 0.05 0.00
Oleg> 99.64
Oleg> 02:20:35 PM all 0.21 0.00 5.23 20.58 0.00
Oleg> 73.98
Oleg> 02:30:51 PM all 0.23 0.00 0.43 31.06 0.00
Oleg> 68.27
Oleg> 02:40:02 PM all 0.06 0.00 0.15 18.55 0.00
Oleg> 81.24
Oleg> Average: all 0.19 0.00 1.54 17.67 0.00
Oleg> 80.61

That looks ok to me... nothing obvious there at all.

Oleg> I/O stat:
Oleg> 02:00:01 PM tps rtps wtps bread/s bwrtn/s
Oleg> 02:10:01 PM 5.27 3.19 2.08 109.29 195.38
Oleg> 02:20:35 PM 4404.80 3841.22 563.58 971542.00 140195.66
Oleg> 02:30:51 PM 1110.49 586.67 523.83 148206.31 131721.52
Oleg> 02:40:02 PM 510.72 211.29 299.43 51321.12 76246.81
Oleg> Average: 1566.86 1214.43 352.43 306453.67 88356.03


Are you writing to a spool disk, before you then write the data into
bacula's backup system?


Oleg> DMs:
Oleg> 02:00:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz
Oleg> await svctm %util
Oleg> Average: dev8-0 370.04 853.43 88355.91 241.08 85.32
Oleg> 230.56 1.61 59.54
Oleg> Average: dev8-16 0.02 0.14 0.02 8.18 0.00
Oleg> 3.71 3.71 0.01
Oleg> Average: dev8-32 1196.77 305599.78 0.04 255.35 4.26
Oleg> 3.56 0.09 11.28
Oleg> Average: dev8-48 0.02 0.35 0.06 18.72 0.00
Oleg> 17.77 17.77 0.04
Oleg> Average: dev253-0 151.59 118.15 1094.56 8.00 13.60
Oleg> 89.71 2.07 31.36
Oleg> Average: dev253-1 15.01 722.81 53.73 51.73 3.08
Oleg> 204.85 28.35 42.56
Oleg> Average: dev253-2 1259.48 218411.68 0.07 173.41 0.21
Oleg> 0.16 0.08 9.98
Oleg> Average: dev253-3 681.29 1.27 87189.52 127.98 163.02
Oleg> 239.29 0.84 57.12
Oleg> Average: dev253-4 3.83 11.09 18.09 7.61 0.09
Oleg> 22.59 10.72 4.11
Oleg> Average: dev253-5 1940.54 305599.86 0.07 157.48 8.47
Oleg> 4.36 0.06 11.24


That's really bursty traffic...


Oleg> dev253:2 is the cache or actually was ...

Oleg> Queue stat:
Oleg> 02:00:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
Oleg> 02:10:01 PM 1 302 0.09 0.05 0.05 0
Oleg> 02:20:35 PM 0 568 6.87 9.72 5.28 3
Oleg> 02:30:51 PM 1 569 5.46 6.83 5.83 2
Oleg> 02:40:02 PM 0 568 0.18 2.41 4.26 1
Oleg> Average: 0 502 3.15 4.75 3.85 2

Oleg> RAM stat:
Oleg> 02:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
Oleg> %commit kbactive kbinact kbdirty
Oleg> 02:10:01 PM 256304 11866580 97.89 66860 9181100 2709288
Oleg> 11.10 5603576 5066808 32
Oleg> 02:20:35 PM 185160 11937724 98.47 56712 39104 2725476
Oleg> 11.17 299256 292604 16
Oleg> 02:30:51 PM 175220 11947664 98.55 56712 29640 2730732
Oleg> 11.19 113912 113552 24
Oleg> 02:40:02 PM 11195028 927856 7.65 57504 62416 2696248
Oleg> 11.05 119488 164076 16
Oleg> Average: 2952928 9169956 75.64 59447 2328065 2715436
Oleg> 11.12 1534058 1409260 22

Oleg> SWAP stat:
Oleg> 02:00:01 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
Oleg> 02:10:01 PM 12010984 277012 2.25 71828 25.93
Oleg> 02:20:35 PM 11048040 1239956 10.09 88696 7.15
Oleg> 02:30:51 PM 10723456 1564540 12.73 38272 2.45
Oleg> 02:40:02 PM 10716884 1571112 12.79 77928 4.96
Oleg> Average: 11124841 1163155 9.47 69181 5.95

I think you're running into a RedHat bug at this point. I'd probably
move to Debian and run my own kernel with the latest patches for MD, etc.

You might even be running into problems with your HW RAID controllers
and how Linux talks to them.

Any chance you could post more details?

Good luck!
John
Mike Snitzer
2017-10-21 03:05:57 UTC
Permalink
On Fri, Oct 20 2017 at 3:35pm -0400,
John Stoffel <***@stoffel.org> wrote:

> >>>>> "Oleg" == Oleg Cherkasov <***@member.fsf.org> writes:
>
> I think you're running into a RedHat bug at this point. I'd probably
> move to Debian and run my own kernel with the latest patches for MD, etc.

There is no reason to think this is a "RedHat bug".. verdict is very much
still out (but yes the kernel core is very different in RHEL7 than
upstream Linux.. though we have no details to suggest _where_ the issue
lies.. if it is a pathologicl dm-cache code issue then RHEL7.4 and
upstream Linux should both see the problem).

Moving distros is a waste of time given that RHEL7.4 and Centos7.4 have
the latest dm-cache code. The issue is likely DM-cache specific (not
RHEL7.4 specific).

In general: RHEL7 or Centos7 will provide the best support of DM-cache.
All developers invested in DM-cache work for Red Hat.

Mike
Oleg Cherkasov
2017-10-21 14:33:07 UTC
Permalink
On 20. okt. 2017 21:35, John Stoffel wrote:
>>>>>> "Oleg" == Oleg Cherkasov <***@member.fsf.org> writes:
>
> Oleg> On 19. okt. 2017 21:09, John Stoffel wrote:
>>>
>
> Oleg> RAM 12Gb, swap around 12Gb as well. /dev/sda is a hardware RAID1, the
> Oleg> rest are RAID5.
>
> Interesting, it's all hardware RAID devices from what I can see.

It is exactly what I wrote initially in my first message!

>
> Can you should the *exact* commands you used to make the cache? Are
> you using lvcache, or bcache? they're two totally different beasts.
> I looked into bcache in the past, but since you can't remove it from
> an LV, I decided not to use it. I use lvcache like this:

I have used lvcache of course and here are commands from bash history:

lvcreate -L 1G -n primary_backup_lv_cache_meta primary_backup_vg /dev/sda5

### Allocate ~247G ib /dev/sda5 what has left of VG
lvcreate -l 100%FREE -n primary_backup_lv_cache primary_backup_vg /dev/sda5

lvconvert --type cache-pool --cachemode writethrough --poolmetadata
primary_backup_vg/primary_backup_lv_cache_meta
primary_backup_vg/primary_backup_lv_cache

lvconvert --type cache --cachepool
primary_backup_vg/primary_backup_lv_cache
primary_backup_vg/primary_backup_lv

### lvconvert failed because required some extra extends in VG so I had
to reduce cache LV and try again:

lvreduce -L 200M primary_backup_vg/primary_backup_lv_cache

### so this time it worked ok:

lvconvert --type cache-pool --cachemode writethrough --poolmetadata
primary_backup_vg/primary_backup_lv_cache_meta
primary_backup_vg/primary_backup_lv_cache
lvconvert --type cache --cachepool
primary_backup_vg/primary_backup_lv_cache
primary_backup_vg/primary_backup_lv

### The exact output of `lvs -a -o +devices` is gone of course because I
had uncached of course however it looks as in docs so did not bring any
suspicions to me.

> How was the performance before your caching tests? Are you looking
> for better compression of your backups? I've used bacula (which
> Bareos is based on) for years, but recently gave up because the
> restores sucked to do. Sorry for the side note. :-)

The performance was good, no complains to aging hardware however having
spare SSD disk I wanted to test if it would improve anything and did not
expect that trivial DD puts whole system on its knees.

> Any messages from the console?

Unfortunately no in logs. As I wrote before I saw a lot of OOM messages
on a killing spree.

> Oleg> User stat:
> Oleg> 02:00:01 PM CPU %user %nice %system %iowait %steal
> Oleg> %idle
> Oleg> 02:10:01 PM all 0.22 0.00 0.08 0.05 0.00
> Oleg> 99.64
> Oleg> 02:20:35 PM all 0.21 0.00 5.23 20.58 0.00
> Oleg> 73.98
> Oleg> 02:30:51 PM all 0.23 0.00 0.43 31.06 0.00
> Oleg> 68.27
> Oleg> 02:40:02 PM all 0.06 0.00 0.15 18.55 0.00
> Oleg> 81.24
> Oleg> Average: all 0.19 0.00 1.54 17.67 0.00
> Oleg> 80.61
>
> That looks ok to me... nothing obvious there at all.

Same is here ...

> Are you writing to a spool disk, before you then write the data into
> bacula's backup system?

Well, Bareos SD was down that time for testing, so it was:

dd if=sime_250G_file of=/dev/null status=process

Basically the first command after allocating LV cache.

>
> I think you're running into a RedHat bug at this point. I'd probably
> move to Debian and run my own kernel with the latest patches for MD, etc.

Would have to stay with CentOS and moving to Debian is not necessarily
solves the problem.

>
> You might even be running into problems with your HW RAID controllers
> and how Linux talks to them.
>
> Any chance you could post more details?

HW RAID controller are PERC H710 and H810. Posting extremely verbose
MegaCli output would not help I guess. Firmware is up to date according
to BIOS Maintenance monitor.
Zdenek Kabelac
2017-10-23 10:58:09 UTC
Permalink
Dne 21.10.2017 v 16:33 Oleg Cherkasov napsal(a):
> On 20. okt. 2017 21:35, John Stoffel wrote:
>>>>>>> "Oleg" == Oleg Cherkasov <***@member.fsf.org> writes:
>>
>> Oleg> On 19. okt. 2017 21:09, John Stoffel wrote:
>>>>
>>
>> Oleg> RAM 12Gb, swap around 12Gb as well.  /dev/sda is a hardware RAID1, the
>> Oleg> rest are RAID5.
>>
>> Interesting, it's all hardware RAID devices from what I can see.
>
> It is exactly what I wrote initially in my first message!
>
>>
>> Can you should the *exact* commands you used to make the cache?  Are
>> you using lvcache, or bcache?  they're two totally different beasts.
>> I looked into bcache in the past, but since you can't remove it from
>> an LV, I decided not to use it.  I use lvcache like this:
>
> I have used lvcache of course and here are commands from bash history:
>
> lvcreate -L 1G -n primary_backup_lv_cache_meta primary_backup_vg /dev/sda5
>
> ### Allocate ~247G ib /dev/sda5 what has left of VG
> lvcreate -l 100%FREE -n primary_backup_lv_cache primary_backup_vg /dev/sda5
>
> lvconvert --type cache-pool --cachemode writethrough --poolmetadata
> primary_backup_vg/primary_backup_lv_cache_meta
> primary_backup_vg/primary_backup_lv_cache
>
> lvconvert --type cache --cachepool primary_backup_vg/primary_backup_lv_cache
> primary_backup_vg/primary_backup_lv
>
> ### lvconvert failed because required some extra extends in VG so I had to
> reduce cache LV and try again:
>
> lvreduce -L 200M primary_backup_vg/primary_backup_lv_cache
>


Hi

Without plans to interrupt thoughts on topic here - the explanation here is
very simple.

Cache pool is made from 'data' & 'metadata' LV - so both needs some space.
In the case of 'cache pool' it's pretty good plan to have both device is fast
spindle (SSD).

So can you please provide output of:

lvs -a -o+devices

so it could be easily validated both _cdata & _cmeta LV is hosted by some SSD
device (it's not shown anywhere in the thread - so just to be sure we have
them on right disks)

Regards

Zdenek
Mike Snitzer
2017-10-21 02:55:00 UTC
Permalink
On Thu, Oct 19 2017 at 5:59pm -0400,
Oleg Cherkasov <***@member.fsf.org> wrote:

> On 19. okt. 2017 21:09, John Stoffel wrote:
> >
> > Oleg> Recently I have decided to try out LVM cache feature on one of
> > Oleg> our Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk
> > Oleg> array (hardware RAID5 with H710 and H830 Dell adapters). Two
> > Oleg> SSD disks each 256Gb are in hardware RAID1 using H710 adapter
> > Oleg> with primary and extended partitions so I decided to make ~240Gb
> > Oleg> LVM cache to see if system I/O may be improved. The server is
> > Oleg> running Bareos storage daemon and beside sshd and Dell
> > Oleg> OpenManage monitoring does not have any other services.
> > Oleg> Unfortunately testing went not as I expected nonetheless at the
> > Oleg> end system is up and running with no data corrupted.
> >
> > Can you give more details about the system. Is this providing storage
> > services (NFS) or is it just a backup server?
>
> It is just a backup server, Bareos Storage Daemon + Dell OpenManage
> for LSI RAID cards (Dell's H7XX and H8XX are LSI based). That host
> deliberately do no share any files or resources for security
> reasons, so no NFS or SMB.
>
> Server has 2x SSD drives by 256Gb each and 10x 3Tb drives. In
> addition there are two MD1200 disk arrays attached with 12x 4Tb
> disks each. All disks exposed to CentOS as Virtual so there are 4
> disks in total:
>
> NAME MAJ:MIN RM SIZE RO TYPE
> sda 8:0 0 278.9G 0 disk
> ├─sda1 8:1 0 500M 0 part /boot
> ├─sda2 8:2 0 36.1G 0 part
> │ ├─centos-swap 253:0 0 11.7G 0 lvm [SWAP]
> │ └─centos-root 253:1 0 24.4G 0 lvm
> ├─sda3 8:3 0 1K 0 part
> └─sda5 8:5 0 242.3G 0 part
> sdb 8:16 0 30T 0 disk
> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
> sdc 8:32 0 40T 0 disk
> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
> sdd 8:48 0 40T 0 disk
> └─primary_backup_vg-primary_backup_lv 253:5 0 110.1T 0 lvm
>
> RAM 12Gb, swap around 12Gb as well. /dev/sda is a hardware RAID1,
> the rest are RAID5.
>
> I did make a cache and cache_meta on /dev/sda5. It used to be a
> partition for Bareos spool for quite some time and because after
> upgrading to 10GbBASE network I do not need that spooler any more so
> I decided to try LVM cache.
>
> > How did you setup your LVM config and your cache config? Did you
> > mirror the two SSDs using MD, then add the device into your VG and use
> > that to setup the lvcache?
> All configs are stock CentOS 7.4 at the moment (incrementally
> upgraded from 7.0 of course), so I did not customize or tried to
> make any optimization on config.
> > I ask because I'm running lvcache at home on my main file/kvm server
> > and I've never seen this problem. But! I suspect you're running a
> > much older kernel, lvm config, etc. Please post the full details of
> > your system if you can.
> 3.10.0-693.2.2.el7.x86_64
>
> CentOS 7.4, as been pointed by Xen, released about a month ago and I
> had updated about a week ago while doing planned maintenance on
> network so had a good excuse to reboot it.
>
> > Oleg> Initially I have tried the default writethrough mode and after
> > Oleg> running dd reading test with 250Gb file got system unresponsive
> > Oleg> for roughly 15min with cache allocation around 50%. Writing to
> > Oleg> disks it seems speed up the system however marginally, so around
> > Oleg> 10% on my tests and I did manage to pull more than 32Tb via
> > Oleg> backup from different hosts and once system became unresponsive
> > Oleg> to ssh and icmp requests however for a very short time.
> >
> > Can you run 'top' or 'vmstat -admt 10' on the console while you're
> > running your tests to see what the system does? How does memory look
> > on this system when you're NOT runnig lvcache?
>
> Well, it is a production system and I am not planning to cache it
> again for test however if any patches would be available then try to
> run a similar system test on spare box before converting it to
> FreeBSD with ZFS.
>
> Nonetheless I tried to run top during the dd reading test however
> with in first few minutes I did not notice any issues with RAM.
> System was using less then 2Gb of 12GB and the rest are wired
> (cache/buffers). After few minutes system became unresponsive even
> dropping ICMP ping requests and ssh session frozen and then dropped
> after time out, so no way to check top measurements.
>
> I have recovered some of SAR records and I may see the last 20
> minutes SAR did not manage to log anything from 2:40pm to 3:00pm
> before system got rebooted and back online at 3:10pm:
>
> User stat:
> 02:00:01 PM CPU %user %nice %system %iowait
> %steal %idle
> 02:10:01 PM all 0.22 0.00 0.08 0.05
> 0.00 99.64
> 02:20:35 PM all 0.21 0.00 5.23 20.58
> 0.00 73.98
> 02:30:51 PM all 0.23 0.00 0.43 31.06
> 0.00 68.27
> 02:40:02 PM all 0.06 0.00 0.15 18.55
> 0.00 81.24
> Average: all 0.19 0.00 1.54 17.67
> 0.00 80.61
>
> I/O stat:
> 02:00:01 PM tps rtps wtps bread/s bwrtn/s
> 02:10:01 PM 5.27 3.19 2.08 109.29 195.38
> 02:20:35 PM 4404.80 3841.22 563.58 971542.00 140195.66
> 02:30:51 PM 1110.49 586.67 523.83 148206.31 131721.52
> 02:40:02 PM 510.72 211.29 299.43 51321.12 76246.81
> Average: 1566.86 1214.43 352.43 306453.67 88356.03
>
> DMs:
> 02:00:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz
> avgqu-sz await svctm %util
> Average: dev8-0 370.04 853.43 88355.91 241.08
> 85.32 230.56 1.61 59.54
> Average: dev8-16 0.02 0.14 0.02 8.18
> 0.00 3.71 3.71 0.01
> Average: dev8-32 1196.77 305599.78 0.04 255.35
> 4.26 3.56 0.09 11.28
> Average: dev8-48 0.02 0.35 0.06 18.72
> 0.00 17.77 17.77 0.04
> Average: dev253-0 151.59 118.15 1094.56 8.00
> 13.60 89.71 2.07 31.36
> Average: dev253-1 15.01 722.81 53.73 51.73
> 3.08 204.85 28.35 42.56
> Average: dev253-2 1259.48 218411.68 0.07 173.41
> 0.21 0.16 0.08 9.98
> Average: dev253-3 681.29 1.27 87189.52 127.98
> 163.02 239.29 0.84 57.12
> Average: dev253-4 3.83 11.09 18.09 7.61
> 0.09 22.59 10.72 4.11
> Average: dev253-5 1940.54 305599.86 0.07 157.48
> 8.47 4.36 0.06 11.24
>
> dev253:2 is the cache or actually was ...
>
> Queue stat:
> 02:00:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
> 02:10:01 PM 1 302 0.09 0.05 0.05 0
> 02:20:35 PM 0 568 6.87 9.72 5.28 3
> 02:30:51 PM 1 569 5.46 6.83 5.83 2
> 02:40:02 PM 0 568 0.18 2.41 4.26 1
> Average: 0 502 3.15 4.75 3.85 2
>
> RAM stat:
> 02:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached
> kbcommit %commit kbactive kbinact kbdirty
> 02:10:01 PM 256304 11866580 97.89 66860 9181100
> 2709288 11.10 5603576 5066808 32
> 02:20:35 PM 185160 11937724 98.47 56712 39104
> 2725476 11.17 299256 292604 16
> 02:30:51 PM 175220 11947664 98.55 56712 29640
> 2730732 11.19 113912 113552 24
> 02:40:02 PM 11195028 927856 7.65 57504 62416
> 2696248 11.05 119488 164076 16
> Average: 2952928 9169956 75.64 59447 2328065
> 2715436 11.12 1534058 1409260 22
>
> SWAP stat:
> 02:00:01 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
> 02:10:01 PM 12010984 277012 2.25 71828 25.93
> 02:20:35 PM 11048040 1239956 10.09 88696 7.15
> 02:30:51 PM 10723456 1564540 12.73 38272 2.45
> 02:40:02 PM 10716884 1571112 12.79 77928 4.96
> Average: 11124841 1163155 9.47 69181 5.95

So aside from SAR outout: you don't have any system logs? Or a vmcore
of the system (assuming it crashed?) -- in it you could access the
kernel log (via 'log' command in crash utility.

More specifics on the workload would be useful. Also, more details on
the LVM cache configuration (block size? writethrough or writeback?
etc).

I'll be looking very closely for any sign of memory leaks (both with
code inspection and testing while kemmleak is enabled).

But the more info you can provide on the workload the better.

Thanks,
Mike

p.s. RHEL7.4 has all of upstream's dm-cache code.
p.p.s.:
I've implemented parallel submission of write IO for writethrough mode.
It needs further testing and review but so far it seems to be working;
yet to see a huge improvement in writethrough mode throughput but
overall IO latencies on writes may be improved (at least closer to that
of the slow device in the cache). Haven't looked at latency yet (will
test further with fio on Monday).
Oleg Cherkasov
2017-10-21 14:10:36 UTC
Permalink
On 21. okt. 2017 04:55, Mike Snitzer wrote:
> On Thu, Oct 19 2017 at 5:59pm -0400,
> Oleg Cherkasov <***@member.fsf.org> wrote:
>
>> On 19. okt. 2017 21:09, John Stoffel wrote:
>>>
>
> So aside from SAR outout: you don't have any system logs? Or a vmcore
> of the system (assuming it crashed?) -- in it you could access the
> kernel log (via 'log' command in crash utility.

Unfortunately no logs. I have tried to see if I may recover dmesg
however no luck. All logs but the latest dmesg boot are zeroed. Of
course there are messages, secure and others however I do not see any
valuable information there.

System did not crash, OOM were going wind however I did manage to
Ctrl-Alt-Del from the main console via iLO so eventually it rebooted
with clean disk umount.

>
> More specifics on the workload would be useful. Also, more details on
> the LVM cache configuration (block size? writethrough or writeback?
> etc).

No extra params but specifying mode writethrough initially. Hardware
RAID1 on cache disk is 64k and on main array hardware RAID5 128k.

I had followed precisely documentation from RHEL doc site so lvcreate,
lvconvert to update type and then lvconvert to add cache.

I have decided to try writeback after and shifted cachemode to it with
lvcache.

>
> I'll be looking very closely for any sign of memory leaks (both with
> code inspection and testing while kemmleak is enabled).
>
> But the more info you can provide on the workload the better.

According to SAR there are no records about 20min before I reboot, so I
suspect SAR daemon failed a victim of OOM.
John Stoffel
2017-10-23 20:45:56 UTC
Permalink
>>>>> "Oleg" == Oleg Cherkasov <***@member.fsf.org> writes:

Oleg> On 21. okt. 2017 04:55, Mike Snitzer wrote:
>> On Thu, Oct 19 2017 at 5:59pm -0400,
>> Oleg Cherkasov <***@member.fsf.org> wrote:
>>
>>> On 19. okt. 2017 21:09, John Stoffel wrote:
>>>>
>>
>> So aside from SAR outout: you don't have any system logs? Or a vmcore
>> of the system (assuming it crashed?) -- in it you could access the
>> kernel log (via 'log' command in crash utility.

Oleg> Unfortunately no logs. I have tried to see if I may recover dmesg
Oleg> however no luck. All logs but the latest dmesg boot are zeroed. Of
Oleg> course there are messages, secure and others however I do not see any
Oleg> valuable information there.

Oleg> System did not crash, OOM were going wind however I did manage to
Oleg> Ctrl-Alt-Del from the main console via iLO so eventually it rebooted
Oleg> with clean disk umount.

Bummers. Maybe you can setup a syslog server to use to log verbose
kernel logs elsewhere, including the OOM messages?

>>
>> More specifics on the workload would be useful. Also, more details on
>> the LVM cache configuration (block size? writethrough or writeback?
>> etc).

Oleg> No extra params but specifying mode writethrough initially.
Oleg> Hardware RAID1 on cache disk is 64k and on main array hardware
Oleg> RAID5 128k.

Oleg> I had followed precisely documentation from RHEL doc site so lvcreate,
Oleg> lvconvert to update type and then lvconvert to add cache.

Oleg> I have decided to try writeback after and shifted cachemode to it with
Oleg> lvcache.

>> I'll be looking very closely for any sign of memory leaks (both with
>> code inspection and testing while kemmleak is enabled).
>>
>> But the more info you can provide on the workload the better.

Oleg> According to SAR there are no records about 20min before I reboot, so I
Oleg> suspect SAR daemon failed a victim of OOM.

Maybe if you could take a snapshot of all the processes on the system
before you run the test, and then also run 'vmstat 1' to a log file
while running the test?

As a wierd thought... maybe it's because you have a 1gb meta data LV
that's causing problems? Maybe you need to just accept the default
size?

It might also be instructive to make the cache be just half the SSD in
size and see if that helps. It *might* be that as other people have
mentioned, that your SSD's performance drops off a cliff when it's
mostly full. So reducing the cache size, even to only 80% of the size
of the disk, might give it enough spare empty blocks to stay
performant?

John
matthew patton
2017-10-20 00:12:16 UTC
Permalink
> It is just a backup server,

Then caching is pointless. Furthermore any half-wit caching solution can detect streaming read/write and will deliberately bypass the cache. Furthermore DD has never been a useful benchmark for anything. And if you're not using 'odirect' it's even more pointless.

> Server has 2x SSD drives by 256Gb each

and for purposes of 'cache' should be individual VD and not waste capacity on RAID1. Your controller's battery-backed RAM is for write-back purposes if you want to play that game. Cache is disposable. You can yank the power cord out of the drive and the software will continue. Now if you were TIERing, that's a different topic and depends on the implementation whether or not you can lose a device. The good ones make sure the SSD can disappear and nothing bad happens.

> 10x 3Tb drives.  In addition there are two
> MD1200 disk arrays attached with 12x 4Tb disks each.  All

Raid5 for this size footprint is NUTs. Raid6 is the bare minimum.
Xen
2017-10-20 06:46:02 UTC
Permalink
matthew patton schreef op 20-10-2017 2:12:
>> It is just a backup server,
>
> Then caching is pointless.

That's irrelevant and not up to another person to decide.

> Furthermore any half-wit caching solution
> can detect streaming read/write and will deliberately bypass the
> cache.

The problem was not performance, it was stability.

> Furthermore DD has never been a useful benchmark for anything.
> And if you're not using 'odirect' it's even more pointless.

Performance was not the issue, stability was.

>> Server has 2x SSD drives by 256Gb each
>
> and for purposes of 'cache' should be individual VD and not waste
> capacity on RAID1.

Is probably also going to be quite irrelevant to the problem at hand.

>> 10x 3Tb drives.  In addition there are two
>> MD1200 disk arrays attached with 12x 4Tb disks each.  All
>
> Raid5 for this size footprint is NUTs. Raid6 is the bare minimum.

That's also irrelevant to the problem at hand.
Oleg Cherkasov
2017-10-20 09:59:01 UTC
Permalink
On 20. okt. 2017 08:46, Xen wrote:
> matthew patton schreef op 20-10-2017 2:12:
>>> It is just a backup server,
>>
>> Then caching is pointless.
>
> That's irrelevant and not up to another person to decide.
>
>> Furthermore any half-wit caching solution
>> can detect streaming read/write and will deliberately bypass the
>> cache.
>
> The problem was not performance, it was stability.
>
>> Furthermore DD has never been a useful benchmark for anything.
>> And if you're not using 'odirect' it's even more pointless.
>
> Performance was not the issue, stability was.
>
>>> Server has 2x SSD drives by 256Gb each
>>
>> and for purposes of 'cache' should be individual VD and not waste
>> capacity on RAID1.
>
> Is probably also going to be quite irrelevant to the problem at hand.
>
>>> 10x 3Tb drives.  In addition  there are two
>>> MD1200 disk arrays attached with 12x 4Tb disks each.  All
>>
>> Raid5 for this size footprint is NUTs. Raid6 is the bare minimum.
>
> That's also irrelevant to the problem at hand.

Hi Matthew,

I mostly agree with Xen about stability vs usability issues. I have a
stable system and available SSD partition with unused 240Gb so decided
to run tests with LVM caching using different cache modes. The _test_
results are in my posts so LVM caching has stability issues indeed
regardless how I did set it up.

I do agree I would need to make a separate Virtual hardware volume for
the cache and the most likely do not mirror it. However, the
performance of the system is defined by a weakest point so it may be
indeed the slow SSD of course. I may expect performance degradation
because of that but not whole system lock down, deny of any services and
follow with reboot.

Your assumptions about streaming operations of _just a backup server_
are not quite right. Bareos Directory configuration running on a
separate server pushes that Storage to run multiple backups in parallel
and eventually restores at the same time. Therefore even there are just
few streams going in and out the RAID is really doing random read and
write operations.

DD is definitely is not a good way to test any caching system, I do
agree, however it is first first to try and see any good/bad/ugly
results before running other tests like bonnie++. In my case, the right
next command after 'lvconvert' to cache and 'pvs' to check the status,
were 'dd if=some_250G_file of=/dev/null bs=8M status=process' and that
was the moment everything went completely unexpected with an unplanned
reboot.

About RAID5 vs RAIS6, well, as I mentioned in a separate message there
is a logical volume built of 3 hardware RAID5 virtual disks so it is not
30+ disks in one RAID5 or something. Besides, that server is a
front-end to LTO-6 library so even unexpected happens it would take 3-4
days to pile-up it from client hosts anyway. And I have few disks in
stock so replacing and rebuilding RAID5 takes no more than 12 hours.
RAID5 vs RAID6 is a matter of operational activities efficiency: watch
dog system logs with Graylog2 and Dell OpenManage/MegaRAID, have spare
disk and do everything on time.


Cheers,
Oleg
Oleg Cherkasov
2017-10-19 10:05:45 UTC
Permalink
Hi,

Recently I have decided to try out LVM cache feature on one of our Dell
NX3100 servers running CentOS 7.4.1708 with 110Tb disk array (hardware
RAID5 with H710 and H830 Dell adapters).  Two SSD disks each 256Gb are
in hardware RAID1 using H710 adapter with primary and extended
partitions so I decided to make ~240Gb LVM cache to see if system I/O
may be improved.  The server is running Bareos storage daemon and beside
sshd and Dell OpenManage monitoring does not have any other services. 
Unfortunately testing went not as I expected nonetheless at the end
system is up and running with no data corrupted.

Initially I have tried the default writethrough mode and after running
dd reading test with 250Gb file got system unresponsive for roughly
15min with cache allocation around 50%. Writing to disks it seems speed
up the system however marginally, so around 10% on my tests and I did
manage to pull more than 32Tb via backup from different hosts and once
system became unresponsive to ssh and icmp requests however for a very
short time.

I though it may be something with cache mode so switched to writeback
via lvconvert and run dd reading test again with 250Gb file however that
time everything went completely unexpected.  System started to slow
responding for simple user interactions like list files and run top. And
then became completely unresponsive for about half an hours. Switching
to main console via iLO I saw a lot of OOM messages and kernel tried to
survive therefore randomly killed almost all processes.  Eventually I
did manage to reboot and immediately uncached the array.

My question is about very strange behavior of LVM cache.  Well, I may
expect no performance boost or even I/O degradation however I do not
expect run out of memory and than OOM kicks in.  That server has only
12Gb RAM however it does run only sshd, bareos SD daemon and OpenManange
java based monitoring system so no RAM problems were notices for last
few years running with our LVM cache.

Any ideas what may be wrong?  I have second NX3200 server with similar
hardware setup and it would be switch to FreeBSD 11.1 with ZFS very time
soon however I may try to install CentOS 7.4 first and see if the
problem may be reproduced.

LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.


Thank you!

Oleg
lejeczek
2017-10-20 16:20:49 UTC
Permalink
On 19/10/17 18:54, Oleg Cherkasov wrote:
> Hi,
>
> Recently I have decided to try out LVM cache feature on
> one of our Dell NX3100 servers running CentOS 7.4.1708
> with 110Tb disk array (hardware RAID5 with H710 and H830
> Dell adapters).  Two SSD disks each 256Gb are in hardware
> RAID1 using H710 adapter with primary and extended
> partitions so I decided to make ~240Gb LVM cache to see if
> system I/O may be improved.  The server is running Bareos
> storage daemon and beside sshd and Dell OpenManage
> monitoring does not have any other services. Unfortunately
> testing went not as I expected nonetheless at the end
> system is up and running with no data corrupted.
>
> Initially I have tried the default writethrough mode and
> after running dd reading test with 250Gb file got system
> unresponsive for roughly 15min with cache allocation
> around 50%.  Writing to disks it seems speed up the system
> however marginally, so around 10% on my tests and I did
> manage to pull more than 32Tb via backup from different
> hosts and once system became unresponsive to ssh and icmp
> requests however for a very short time.
>
> I though it may be something with cache mode so switched
> to writeback via lvconvert and run dd reading test again
> with 250Gb file however that time everything went
> completely unexpected. System started to slow responding
> for simple user interactions like list files and run top.
> And then became completely unresponsive for about half an
> hours.  Switching to main console via iLO I saw a lot of
> OOM messages and kernel tried to survive therefore
> randomly killed almost all processes.  Eventually I did
> manage to reboot and immediately uncached the array.
>
> My question is about very strange behavior of LVM cache. 
> Well, I may expect no performance boost or even I/O
> degradation however I do not expect run out of memory and
> than OOM kicks in.  That server has only 12Gb RAM however
> it does run only sshd, bareos SD daemon and OpenManange
> java based monitoring system so no RAM problems were
> notices for last few years running with our LVM cache.
>
> Any ideas what may be wrong?  I have second NX3200 server
> with similar hardware setup and it would be switch to
> FreeBSD 11.1 with ZFS very time soon however I may try to
> install CentOS 7.4 first and see if the problem may be
> reproduced.
>
> LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.
>
>
> Thank you!
> Oleg
>
hi

not much of an explanation nor insight as to what might be
going wrong with your setup/system but, instead my own
conclusions/suggestions as a result of bits of my
experience, I will share...

I would - if bigger part of a storage subsystem resides in
the hardware - stick to the hardware, use CacheCade, let the
hardware do the lot.

On LVM - similarly, stick to LVM, let LVM manage the whole
lot (you will loose ~50% of a single average core(opteron
6376) with raid5). Use the simplest HBAs(dell have such), no
raid, not even JBOD. If disks are in same one enclosure, or
simply under same one HBA(even though it's just a HBA) - do
*not *mix SATA & SAS(it may work, but better not, from my
experience)

Last one, keep that freaking firmware updated, everywhere
possible, disks too(my latest experience with Seagate 2TB
SAS, over hundred of those in two enclosures - I cannot,
update does not work - Seagate's off the website tech
support => useless = stay away from Seagate.)

I'll keep my fingers crossed for you - On luck - never too
much of it.

> _______________________________________________
> linux-lvm mailing list
> linux-***@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Xen
2017-10-20 16:48:31 UTC
Permalink
lejeczek schreef op 20-10-2017 16:20:

> I would - if bigger part of a storage subsystem resides in the
> hardware - stick to the hardware, use CacheCade, let the hardware do
> the lot.

In other words -- keep it simple (smart person) ;-).

Complicatedness is really the biggest reason for failure everywhere....
Bernd Eckenfels
2017-10-20 17:02:05 UTC
Permalink
Not sure if this is on-topic but there is a reason for software solutions. The days for super properties and faulty hardware is over. You put Software like lvm on top of mass market stupid hardware exactly to reduce complexity.

Gruss
Bernd
--
http://bernd.eckenfels.net
________________________________
From: linux-lvm-***@redhat.com <linux-lvm-***@redhat.com> on behalf of Xen <***@xenhideout.nl>
Sent: Friday, October 20, 2017 6:48:31 PM
To: linux-***@redhat.com
Subject: Re: [linux-lvm] cache on SSD makes system unresponsive

lejeczek schreef op 20-10-2017 16:20:

> I would - if bigger part of a storage subsystem resides in the
> hardware - stick to the hardware, use CacheCade, let the hardware do
> the lot.

In other words -- keep it simple (smart person) ;-).

Complicatedness is really the biggest reason for failure everywhere....
matthew patton
2017-10-21 16:05:32 UTC
Permalink
0) what is the full DD command you are issuing? (I think we have this)

1) does your DD command work when LVM is not using caching of any kind.

2) does your DD command work if using 'direct' mode

3) are you able to write smaller chunks from NON-cached LVM volume to SSD vdev? Is there an inflection point in size where it goes haywire?

4) what is your IO elevator/scheduler set to?

5) what is value of
vm.dirty_background_ratio
vm.dirty_ratio
vm.dirty_background_bytes
vm.dirty_bytes

What do you observe in /proc/vmstat during DD?

6) run DD via strace
Oleg Cherkasov
2017-10-24 18:09:46 UTC
Permalink
Some of your questions are answered int thread ...

On 21. okt. 2017 18:05, matthew patton wrote:
> 0) what is the full DD command you are issuing? (I think we have this)

dd if=file_250G of=/dev/null status=progress

>
> 1) does your DD command work when LVM is not using caching of any kind.

Just dd had been running.

>
> 2) does your DD command work if using 'direct' mode

nope

>
> 3) are you able to write smaller chunks from NON-cached LVM volume to SSD vdev? Is there an inflection point in size where it goes haywire?

Tried for a smaller file, system became unresponsive for few minutes,
LVM cache 51% however system survived with no reboot.

>
> 4) what is your IO elevator/scheduler set to?

deadline for all disks in LV

>
> 5) what is value of
> vm.dirty_background_ratio
> vm.dirty_ratio
> vm.dirty_background_bytes
> vm.dirty_bytes

vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

>
> What do you observe in /proc/vmstat during DD?
>
> 6) run DD via strace

Once again, system were not responding to ICMP so checking vmstat does
not make any sense because of deny of service to ssh or terminal.

strace? What are you planning to see their? open() and continues read()
system calls?
matthew patton
2017-10-21 16:08:42 UTC
Permalink
>  But if it's something that only exhibits in writeback mode rather than writethrough, then I'd guess it's to do with the
> list of writeback work that the policy builds.  So check

OP is being coy about the DD command but I saw it mentioned off-hand earlier.

dd if=/250GB_file/on_LVM/H800_vdev of=/dev/null

It's a pure, streaming read. LVM cache should be doing absolutely nothing.
matthew patton
2017-10-23 21:02:35 UTC
Permalink
>On Mon, 10/23/17, John Stoffel <***@stoffel.org> wrote:

SSD pathologies aside, why are we concerned about the cache layer on a streaming read?

By definition the cache shouldn't be involved at all.
Xen
2017-10-23 21:54:28 UTC
Permalink
matthew patton schreef op 23-10-2017 21:02:
>> On Mon, 10/23/17, John Stoffel <***@stoffel.org> wrote:
>
> SSD pathologies aside, why are we concerned about the cache layer on a
> streaming read?
>
> By definition the cache shouldn't be involved at all.

Because whatever purpose you are using it for, it shouldn't OOM the
system.
John Stoffel
2017-10-24 02:51:26 UTC
Permalink
>>>>> "matthew" == matthew patton <***@yahoo.com> writes:

>> On Mon, 10/23/17, John Stoffel <***@stoffel.org> wrote:

matthew> SSD pathologies aside, why are we concerned about the cache
matthew> layer on a streaming read?

matthew> By definition the cache shouldn't be involved at all.

Because his system is going into OOM when doing this? Yes, the cache
won't probably do anything for a streaming read, it needs to be
primed. But when the system craps out... it cries out to be figured
out.
matthew patton
2017-10-23 23:40:05 UTC
Permalink
> Because whatever purpose you are using it for, it shouldn't OOM the system.

I posted a 6 point query to the list 2 days ago as to what are the various settings being used (not LVM related) and also pointed out that not using odirect was necessarily going to try to stuff the file into the linux vm system which was bound to cause all kind of grief.

Maybe I'm missing responses but I haven't seen any answers to those questions which has nothing to do with LVM. I would be very surprised this has anything to do with LVM.
Xen
2017-10-24 15:36:44 UTC
Permalink
matthew patton schreef op 24-10-2017 1:40:
>> Because whatever purpose you are using it for, it shouldn't OOM the
>> system.
>
> I posted a 6 point query to the list 2 days ago as to what are the
> various settings being used (not LVM related) and also pointed out
> that not using odirect was necessarily going to try to stuff the file
> into the linux vm system which was bound to cause all kind of grief.
>
> Maybe I'm missing responses but I haven't seen any answers to those
> questions which has nothing to do with LVM. I would be very surprised
> this has anything to do with LVM.

LVM is a system that is meant to run flawlessly without extra
configuration.

You seem to be invested in not having problems solved.

I don't know.
lejeczek
2017-10-24 14:51:45 UTC
Permalink
On 19/10/17 18:54, Oleg Cherkasov wrote:
> Hi,
>
> Recently I have decided to try out LVM cache feature on
> one of our Dell NX3100 servers running CentOS 7.4.1708
> with 110Tb disk array (hardware RAID5 with H710 and H830
> Dell adapters).  Two SSD disks each 256Gb are in hardware
> RAID1 using H710 adapter with primary and extended
> partitions so I decided to make ~240Gb LVM cache to see if
> system I/O may be improved.  The server is running Bareos
> storage daemon and beside sshd and Dell OpenManage
> monitoring does not have any other services. Unfortunately
> testing went not as I expected nonetheless at the end
> system is up and running with no data corrupted.
>
> Initially I have tried the default writethrough mode and
> after running dd reading test with 250Gb file got system
> unresponsive for roughly 15min with cache allocation
> around 50%.  Writing to disks it seems speed up the system
> however marginally, so around 10% on my tests and I did
> manage to pull more than 32Tb via backup from different
> hosts and once system became unresponsive to ssh and icmp
> requests however for a very short time.
>
> I though it may be something with cache mode so switched
> to writeback via lvconvert and run dd reading test again
> with 250Gb file however that time everything went
> completely unexpected. System started to slow responding
> for simple user interactions like list files and run top.
> And then became completely unresponsive for about half an
> hours.  Switching to main console via iLO I saw a lot of
> OOM messages and kernel tried to survive therefore
> randomly killed almost all processes.  Eventually I did
> manage to reboot and immediately uncached the array.
>
> My question is about very strange behavior of LVM cache. 
> Well, I may expect no performance boost or even I/O
> degradation however I do not expect run out of memory and
> than OOM kicks in.  That server has only 12Gb RAM however
> it does run only sshd, bareos SD daemon and OpenManange
> java based monitoring system so no RAM problems were
> notices for last few years running with our LVM cache.
>
> Any ideas what may be wrong?  I have second NX3200 server
> with similar hardware setup and it would be switch to
> FreeBSD 11.1 with ZFS very time soon however I may try to
> install CentOS 7.4 first and see if the problem may be
> reproduced.
>
> LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.
>
>
> Thank you!
> Oleg

I realized that same day I replied, mailman disabled my
subscription, so in case it did not get through, again:

hi

not much of an explanation nor insight as to what might be
going wrong with your setup/system but, instead my own
conclusions/suggestions as a result of bits of my
experience, I will share...

I would - if bigger part of a storage subsystem resides in
the hardware - stick to the hardware, use CacheCade, let the
hardware do the lot.

On LVM - similarly, stick to LVM, let LVM manage the whole
lot (you will loose ~50% of a single average core(opteron
6376) with raid5). Use the simplest HBAs(dell have such), no
raid, not even JBOD. If disks are in same one enclosure, or
simply under same one HBA(even though it's just a HBA) - do
*not *mix SATA & SAS(it may work, but better not, from my
experience)

Last one, keep that freaking firmware updated, everywhere
possible, disks too(my latest experience with Seagate 2TB
SAS, over hundred of those in two enclosures - I cannot,
update does not work - Seagate's off the website tech
support => useless = stay away from Seagate.)

I'll keep my fingers crossed for you - On luck - never too
much of it.

>
> _______________________________________________
> linux-lvm mailing list
> linux-***@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
matthew patton
2017-10-24 22:01:25 UTC
Permalink
Oleg wrote:

>> 0) what is the full DD command you are issuing? (I think we have this)
> dd if=file_250G of=/dev/null status=progress

You do realize this is copying data to virtual memory (ie it's buffering data) when that's pointless in both benchmark and backup/restore purposes. And also generating VM pressure and swapping until it's forced to discard pages or resort to OOM.

>> 1) does your DD command work when LVM is not using caching of any kind.
> Just dd had been running.

I mean you degraded your LVM device holding the 250GB to not have any caching at all (lvconvert --splitcache VG/CacheLV) and otherwise removed any and all associations with the SSD virtual device?

>> 2) does your DD command work if using 'direct' mode
> nope

what command modifiers did you use precisely? And this failure was also observed with striaght-up NON-cached LVM too?

>> 3) are you able to write smaller chunks from NON-cached LVM volume to SSD vdev?
>> Is there an inflection point in size where it goes haywire?

> Tried for a smaller file, system became unresponsive for few minutes,
> LVM cache 51% however system survived with no reboot.

What was the size of this file that succeeded, if poorly?

How in the hell is the LVM cache being used at all? It has no business caching ANYTHING on streaming reads. Hmm, it turns out dm-cache/lvmcache really is retarded. It copies data to cache on first read and furthermore doesn't appear to detect streaming reads which have no value for caching purposes.

Somebody thought they were doing the world a favor when they clearly had insufficient real-world experience. Worse, you can't even tune away the not necessarily helpful assumptions.
https://www.mjmwired.net/kernel/Documentation/device-mapper/cache-policies.txt

If you guys over at RedHat would oblige with a Nerf clue-bat to the persons involved, being able to forcibly override the cache/promotion settings would be a very nice thing to have back. For most situations it may not have any real value, but for this pathological workload, a sysadmin should be able to intervene.

Much of what is below is besides the point now that dm-cache is stuck in permanent 'dummy mode'. I maintain that using SSD caching on your application (backup server, all streaming read/write) to be a total waste of time anyway. If you still persist in wanting a modicum of caching intelligence use BCache, (BTier?) or LSI Cachecade.

--------------------
what is output of
lvs -o+cache_policy,cache_settings VG/CacheLV

Please remove LVM caching capability from everywhere including the origin volume and test writing to raw SSD virtual disk. ie. /dev/sdxx whatever the Dell VD is as recognized by the SCSI layer. I suspect your SSD is crap and/or the Perc+SSD combo is crap. Please test them independently of any confounding influences of your LVM origin. Test the raw block device, not anything (filesystem or lvm) layered on top.

What brand/type SSDs are we talking about?

Unless the rules have changed for a 250GB cache dataLV you need a metadata of at least 250MB. Somewhere I think someone said you had a whole lot less? Or did you alloc 1GB to the metadata and I'm mis-remembering?

What size did you set your cache_blocks to? 256k?

What is the output of dmsetup on your LVM origin in cached mode?

What did you set read_promote_adjustment and write_promote_adjustment to?
Chris Friesen
2017-10-24 23:10:27 UTC
Permalink
On 10/24/2017 04:01 PM, matthew patton wrote:

> How in the hell is the LVM cache being used at all? It has no business
> caching ANYTHING on streaming reads. Hmm, it turns out dm-cache/lvmcache
> really is retarded. It copies data to cache on first read and furthermore
> doesn't appear to detect streaming reads which have no value for caching
> purposes.

Technically it's not entirely true to say that streaming reads have no value for
caching purposes. It's conceivable to have a workload where the same file gets
read over and over, in which case it might be useful to have it cached on an SSD.

As I understand it dm-cache is using smq, which essentially uses an LRU
algorithm. So yes, it'll read the streaming data into the cache, but the
read-once/written-never data should also be the most likely to be evicted from
the cache.

For what it's worth, the Linux kernel also copies data to the page cache on
reads, which is why they introduced posix_fadvise(POSIX_FADV_DONTNEED) to allow
the application to indicate that it's done with the data and it can be dropped
from the page cache.

Chris
Loading...