[linux-lvm] Snapshot behavior on classic LVM vs ThinLVM

Discussion:

[linux-lvm] Snapshot behavior on classic LVM vs ThinLVM

Gionatan Danti

2017-04-06 14:31:31 UTC

Hi all,
I'm seeking some advice for a new virtualization system (KVM) on top of
LVM. The goal is to take agentless backups via LVM snapshots.

In short: what you suggest to snapshot a quite big (8+ TB) volume?
Classic LVM (with old snapshot behavior) or thinlvm (and its new
snapshot method)?

Long story:
In the past, I used classical, preallocated logical volumes directly
exported as virtual disks. In this case, I snapshot the single LV I want
to backup and, using dd/ddrescue, I copy it.

Problem is this solution prevents any use of thin allocation or sparse
files, so I tried to replace it with something filesystem-based. Lately
I used another approach, configuring a single thinly provisioned LV
(with no zeroing) + XFS + raw or qcow2 virtual machine images. To make
backups, I snapshotted the entire thin LV and, after mounting it, I
copied the required files.

So far this second solution worked quite well. However, before using it
in more and more installations, I wonder if it is the correct approach
or if something better, especially from a stability standpoint, is possible.

Gived that I would like to use XFS, and that I need snapshots at the
block level, two possibilities came to mind:

1) continue to use thinlvm + thin snapshots + XFS. What do you think
about a 8+ TB thin pool/volume with relatively small (64/128KB) chunks?
Would you be comfortable using it in production workloads? What about
powerloss protection? From my understanding, thinlvm passes flushes
anytime the higher layers issue them and so should be reasonable safe
against unexpected powerloss. Is this view right?

2) use a classic (non-thin) LVM + normal snapshot + XFS. I know for sure
that LV size is not an issue here, however big snapshot size used to be
problematic: the CoW table had to be read completely before the snapshot
can be activated. Is this problem a solved one? Or big snapshot can be
problematic?

Thank you all.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Mark Mielke

2017-04-07 08:19:24 UTC

On Thu, Apr 6, 2017 at 10:31 AM, Gionatan Danti <***@assyoma.it> wrote:

> I'm seeking some advice for a new virtualization system (KVM) on top of
> LVM. The goal is to take agentless backups via LVM snapshots.
>
> In short: what you suggest to snapshot a quite big (8+ TB) volume? Classic
> LVM (with old snapshot behavior) or thinlvm (and its new snapshot method)?
>

I found classic LVM snapshots to suffer terrible performance. I switched to
BTRFS as a result, until LVM thin pools became a real thing, and I happily
switched back.

I expect this depends on exactly what access patterns you have, how many
accesses will happen during the time the snapshot is held, and whether you
are using spindles or flash. Still, even with some attempt to be objective
and critical... I think I would basically never use classic LVM snapshots
for any purpose, ever.

--
Mark Mielke <***@gmail.com>

Gionatan Danti

2017-04-07 09:12:25 UTC

Il 07-04-2017 10:19 Mark Mielke ha scritto:
>
> I found classic LVM snapshots to suffer terrible performance. I
> switched to BTRFS as a result, until LVM thin pools became a real
> thing, and I happily switched back.

So you are now on lvmthin? Can I ask on what pool/volume/filesystem
size?

>
> I expect this depends on exactly what access patterns you have, how
> many accesses will happen during the time the snapshot is held, and
> whether you are using spindles or flash. Still, even with some attempt
> to be objective and critical... I think I would basically never use
> classic LVM snapshots for any purpose, ever.
>

Sure, but for nightly backups reduced performance should not be a
problem. Moreover, increasing snapshot chunk size (eg: from default 4K
to 64K) gives much faster write performance.

I more concerned about lenghtly snapshot activation due to a big, linear
CoW table that must be read completely...

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

L A Walsh

2017-04-07 13:50:34 UTC

Gionatan Danti wrote:
> I more concerned about lenghtly snapshot activation due to a big,
> linear CoW table that must be read completely...
---
What is 'big'? Are you just worried about the IO time?
If that's the case, much will depend on your HW. Are we talking
using 8T hard disks concatenated into a single volume, or in a
RAID1, or what? W/a HW-RAID10 getting over 1GB/s isn't
difficult for a contiguous read. So how big is the CoW table
and how fragmented is it? Even w/fragments, with enough spindles
you could still, likely, get enough I/O Ops where I/O speed shouldn't
be a critical bottleneck...

However, regarding performance, I used to take daily snapshots
using normal LVM (before thin was available) w/rsync creating a
a difference volume between yesterday's snapshot and today's content.
On a 1TB volume @ ~75% full, it would take 45min - 1.5 hours to
create. Multiplied by 8...backups wouldn't just be 'nightly'.
That was using about 12 data spindles.

Unfortunately I've never benched the thin volumes. Also,
they were NOT for backup purposes (those were separate using
xfsdump). Besides performance and reliability, a main reason
to use snapshots was to provide "previous versions" of files to
windows clients. That allowed quick recoveries from file-wiping
mistakes by opening the previous version of the file or
containing directory.

Gionatan Danti

2017-04-07 16:33:47 UTC

Il 07-04-2017 15:50 L A Walsh ha scritto:
> Gionatan Danti wrote:
>> I more concerned about lenghtly snapshot activation due to a big,
>> linear CoW table that must be read completely...
> ---
> What is 'big'? Are you just worried about the IO time?
> If that's the case, much will depend on your HW. Are we talking
> using 8T hard disks concatenated into a single volume, or in a
> RAID1, or what? W/a HW-RAID10 getting over 1GB/s isn't
> difficult for a contiguous read. So how big is the CoW table
> and how fragmented is it? Even w/fragments, with enough spindles
> you could still, likely, get enough I/O Ops where I/O speed shouldn't
> be a critical bottleneck...

For the logical volume itself, I target a 8+ TB size. However, what
worries me is *not* LV size by itself (I know that LVM can be used on
volume much bigger than that), rather the snapshot CoW table. In short,
from reading this list and from first-hand testing, big snapshots (20+
GB) require lenghtly activation, due to inefficiency in how classic
metadata (ie: non thinly-provided) are layed out/used. However, I read
that this was somewhat addressed lately. Do you have any insight?

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Stuart Gathman

2017-04-13 12:59:07 UTC

Using a classic snapshot for backup does not normally involve activating
a large CoW. I generally create a smallish snapshot (a few gigs), that
will not fill up during the backup process. If for some reason, a
snapshot were to fill up before backup completion, reads from the
snapshot get I/O errors (I've tested this), which alarms and aborts the
backup. Yes, keeping a snapshot around and activating it at boot can be
a problem as the CoW gets large.

If you are going to keep snapshots around indefinitely, the thinpools
are probably the way to go. (What happens when you fill up those?
Hopefully it "freezes" the pool rather than losing everything.)

On 04/07/2017 12:33 PM, Gionatan Danti wrote:

> For the logical volume itself, I target a 8+ TB size. However, what
> worries me is *not* LV size by itself (I know that LVM can be used on
> volume much bigger than that), rather the snapshot CoW table. In
> short, from reading this list and from first-hand testing, big
> snapshots (20+ GB) require lenghtly activation, due to inefficiency in
> how classic metadata (ie: non thinly-provided) are layed out/used.
> However, I read that this was somewhat addressed lately. Do you have
> any insight?
>

Xen

2017-04-13 13:52:17 UTC

Stuart Gathman schreef op 13-04-2017 14:59:

> If you are going to keep snapshots around indefinitely, the thinpools
> are probably the way to go. (What happens when you fill up those?
> Hopefully it "freezes" the pool rather than losing everything.)

My experience is that the system crashes.

I have not tested this with a snapshot but a general thin pool overflow
crashes the system.

Within half a minute, I think.

It is irrelevant whether the volumes had anything to do with the
operation of the system; ie. some mounted volumes that you write to that
are in no other use will crash the system.

Zdenek Kabelac

2017-04-13 14:33:30 UTC

Dne 13.4.2017 v 15:52 Xen napsal(a):
> Stuart Gathman schreef op 13-04-2017 14:59:
>
>> If you are going to keep snapshots around indefinitely, the thinpools
>> are probably the way to go. (What happens when you fill up those?
>> Hopefully it "freezes" the pool rather than losing everything.)
>
> My experience is that the system crashes.
>
> I have not tested this with a snapshot but a general thin pool overflow
> crashes the system.
>
> Within half a minute, I think.
>
> It is irrelevant whether the volumes had anything to do with the operation of
> the system; ie. some mounted volumes that you write to that are in no other
> use will crash the system.

Hello

Just let's repeat.

Full thin-pool is NOT in any way comparable to full filesystem.

Full filesystem has ALWAYS room for its metadata - it's not pretending it's
bigger - it has 'finite' space and expect this space to just BE there.

Now when you have thin-pool - it cause quite a lot of trouble across number of
layers. There are solvable and being fixed.

But as the rule #1 still applies - do not run your thin-pool out of space - it
will not always heal easily without losing date - there is not a simple
straighforward way how to fix it (especially when user cannot ADD any new
space he promised to have)

So monitoring pool and taking action ahead in time is always superior solution
to any later postmortem systems restores.

Regards

Zdenek

Xen

2017-04-13 14:47:41 UTC

Zdenek Kabelac schreef op 13-04-2017 16:33:

> Hello
>
> Just let's repeat.
>
> Full thin-pool is NOT in any way comparable to full filesystem.
>
> Full filesystem has ALWAYS room for its metadata - it's not pretending
> it's bigger - it has 'finite' space and expect this space to just BE
> there.
>
> Now when you have thin-pool - it cause quite a lot of trouble across
> number of layers. There are solvable and being fixed.
>
> But as the rule #1 still applies - do not run your thin-pool out of
> space - it will not always heal easily without losing date - there is
> not a simple straighforward way how to fix it (especially when user
> cannot ADD any new space he promised to have)
>
> So monitoring pool and taking action ahead in time is always superior
> solution to any later postmortem systems restores.

Yes that's what I said. If your thin pool runs out, your system will
crash.

Thanks for alluding that this will also happen if a thin snapshot causes
this (obviously).

Regards.

Stuart Gathman

2017-04-13 15:29:34 UTC

On 04/13/2017 10:33 AM, Zdenek Kabelac wrote:
>
>
> Now when you have thin-pool - it cause quite a lot of trouble across
> number of layers. There are solvable and being fixed.
>
> But as the rule #1 still applies - do not run your thin-pool out of
> space - it will not always heal easily without losing date - there is
> not a simple straighforward way how to fix it (especially when user
> cannot ADD any new space he promised to have)
IMO, the friendliest thing to do is to freeze the pool in read-only mode
just before running out of metadata. While still involving application
level data loss (the data it was just trying to write), and still
crashing the system (the system may be up and pingable and maybe even
sshable, but is "crashed" for normal purposes), it is simple to
understand and recover. A sysadmin could have a plain LV for the
system volume, so that logs and stuff would still be kept, and admin
logins work normally. There is no panic, as the data is there read-only.

Xen

2017-04-13 15:43:18 UTC

Stuart Gathman schreef op 13-04-2017 17:29:

> IMO, the friendliest thing to do is to freeze the pool in read-only
> mode
> just before running out of metadata.

It's not about metadata but about physical extents.

In the thin pool.

> While still involving application
> level data loss (the data it was just trying to write), and still
> crashing the system (the system may be up and pingable and maybe even
> sshable, but is "crashed" for normal purposes)

Then it's not crashed. Only some application that may make use of the
data volume may be crashed, but not the entire system.

The point is that errors and some filesystem that has errors=remount-ro,
is okay.

If a regular snapshot that is mounted fills up, the mount is dropped.

System continues operating, as normal.

> , it is simple to
> understand and recover. A sysadmin could have a plain LV for the
> system volume, so that logs and stuff would still be kept, and admin
> logins work normally. There is no panic, as the data is there
> read-only.

Yeah a system panic in terms of some volume becoming read-only is
perfectly acceptable.

However the kernel going entirely mayhem, is not.

Stuart D. Gathman

2017-04-13 17:26:46 UTC

On Thu, 13 Apr 2017, Xen wrote:

> Stuart Gathman schreef op 13-04-2017 17:29:
>
>> understand and recover. A sysadmin could have a plain LV for the
>> system volume, so that logs and stuff would still be kept, and admin
>> logins work normally. There is no panic, as the data is there read-only.
>
> Yeah a system panic in terms of some volume becoming read-only is perfectly
> acceptable.
>
> However the kernel going entirely mayhem, is not.

Heh. I was actually referring to *sysadmin* panic, not kernel panic.
:-)

But yeah, sysadmin panic can result in massive data loss...

--
Stuart D. Gathman <***@gathman.org>
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

Stuart D. Gathman

2017-04-13 17:32:00 UTC

On Thu, 13 Apr 2017, Xen wrote:

> Stuart Gathman schreef op 13-04-2017 17:29:
>
>> IMO, the friendliest thing to do is to freeze the pool in read-only mode
>> just before running out of metadata.
>
> It's not about metadata but about physical extents.
>
> In the thin pool.

Ok. My understanding is that *all* the volumes in the same thin-pool would
have to be frozen when running out of extents, as writes all pull from
the same pool of physical extents.

--
Stuart D. Gathman <***@gathman.org>
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

Xen

2017-04-14 15:17:55 UTC

Stuart D. Gathman schreef op 13-04-2017 19:32:
> On Thu, 13 Apr 2017, Xen wrote:
>
>> Stuart Gathman schreef op 13-04-2017 17:29:
>>
>>> IMO, the friendliest thing to do is to freeze the pool in read-only
>>> mode
>>> just before running out of metadata.
>>
>> It's not about metadata but about physical extents.
>>
>> In the thin pool.
>
> Ok. My understanding is that *all* the volumes in the same thin-pool
> would have to be frozen when running out of extents, as writes all
> pull from
> the same pool of physical extents.

Yes, I simply tested with a small thin pool not used for anything else.

The volumes were not more than a few hundred megabytes big, so easy to
fill up.

Putting a file copy to one of the volumes that the pool couldn't handle,
the system quickly crashed.

Upon reboot it was neatly filled 100% and I could casually remove the
volumes or whatever.

Gionatan Danti

2017-04-14 07:27:20 UTC

Il 13-04-2017 16:33 Zdenek Kabelac ha scritto:
>
> Hello
>
> Just let's repeat.
>
> Full thin-pool is NOT in any way comparable to full filesystem.
>
> Full filesystem has ALWAYS room for its metadata - it's not pretending
> it's bigger - it has 'finite' space and expect this space to just BE
> there.
>
> Now when you have thin-pool - it cause quite a lot of trouble across
> number of layers. There are solvable and being fixed.
>
> But as the rule #1 still applies - do not run your thin-pool out of
> space - it will not always heal easily without losing date - there is
> not a simple straighforward way how to fix it (especially when user
> cannot ADD any new space he promised to have)
>
> So monitoring pool and taking action ahead in time is always superior
> solution to any later postmortem systems restores.
>

If I remember correctly, EXT4 with error=remount-ro should freeze the
filesystem as soon as write errors are detected. Is this configuration
safer than standard behavior? Do you know if XFS (RHEL *default*
filesystem) supports something similar?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Gionatan Danti

2017-04-14 07:23:17 UTC

Il 13-04-2017 14:59 Stuart Gathman ha scritto:
> Using a classic snapshot for backup does not normally involve
> activating
> a large CoW. I generally create a smallish snapshot (a few gigs),
> that
> will not fill up during the backup process. If for some reason, a
> snapshot were to fill up before backup completion, reads from the
> snapshot get I/O errors (I've tested this), which alarms and aborts the
> backup. Yes, keeping a snapshot around and activating it at boot can
> be
> a problem as the CoW gets large.
>
> If you are going to keep snapshots around indefinitely, the thinpools
> are probably the way to go. (What happens when you fill up those?
> Hopefully it "freezes" the pool rather than losing everything.)
>

Hi, no need to keep snapshot around. If so, the classic LVM solution
would be completely inadequate.

I simply worry that, with many virtual machines, even the temporary
backup snapshot can fill up and cause some problem. When the snapshot
fills, apart from it being dropped, there is anything I need to be
worried about?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2017-04-14 15:23:40 UTC

Gionatan Danti schreef op 14-04-2017 9:23:
> Il 13-04-2017 14:59 Stuart Gathman ha scritto:
>> Using a classic snapshot for backup does not normally involve
>> activating
>> a large CoW. I generally create a smallish snapshot (a few gigs),
>> that
>> will not fill up during the backup process. If for some reason, a
>> snapshot were to fill up before backup completion, reads from the
>> snapshot get I/O errors (I've tested this), which alarms and aborts
>> the
>> backup. Yes, keeping a snapshot around and activating it at boot can
>> be
>> a problem as the CoW gets large.
>>
>> If you are going to keep snapshots around indefinitely, the thinpools
>> are probably the way to go. (What happens when you fill up those?
>> Hopefully it "freezes" the pool rather than losing everything.)
>>
>
> Hi, no need to keep snapshot around. If so, the classic LVM solution
> would be completely inadequate.
>
> I simply worry that, with many virtual machines, even the temporary
> backup snapshot can fill up and cause some problem. When the snapshot
> fills, apart from it being dropped, there is anything I need to be
> worried about?

A thin snapshot won't be dropped. It is allocated with the same size as
the origin volume and hence can never fill up.

Only the pool itself can fill up but unless you have some monitoring
software in place that can intervene in case of anomaly and kill the
snapshot, your system will or may simply freeze and not drop anything.

Gionatan Danti

2017-04-14 15:53:18 UTC

Il 14-04-2017 17:23 Xen ha scritto:
> A thin snapshot won't be dropped. It is allocated with the same size
> as the origin volume and hence can never fill up.
>
> Only the pool itself can fill up but unless you have some monitoring
> software in place that can intervene in case of anomaly and kill the
> snapshot, your system will or may simply freeze and not drop anything.
>

Yeah, I understand that. In that sentence, I was speaking about classic
LVM snapshot.

The dilemma is:
- classic LVM snapshots have low performance (but adequate for backup
purpose) and, if growing too much, snapshot activation can be
problematic (especially on boot);
- thin-snapshots have much better performance but does not always fail
gracefully (ie: pool full).

For nightly backups, what you would pick between the two?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Stuart Gathman

2017-04-14 16:08:58 UTC

On 04/14/2017 11:53 AM, Gionatan Danti wrote:
>
> Yeah, I understand that. In that sentence, I was speaking about
> classic LVM snapshot.
>
> The dilemma is:
> - classic LVM snapshots have low performance (but adequate for backup
> purpose) and, if growing too much, snapshot activation can be
> problematic (especially on boot);
> - thin-snapshots have much better performance but does not always fail
> gracefully (ie: pool full).
>
> For nightly backups, what you would pick between the two?
You've summarized it nicely. I currently stick with classic snapshots
for nightly backups with smallish CoW (so in case backup somehow fails
to remove the snapshot, production performance doesn't suffer).

The failure model for classic snapshots is that if the CoW fills, the
snapshot is invalid (both read and write return IOerror), but otherwise
the system keeps humming along smoothly (with no more performance
penalty on the source volume).

Before putting production volumes in a thinpool, the failure model needs
to be sane. However much the admin is enjoined to never let the pool be
empty - it *will* happen. Having the entire pool freeze in readonly
mode (without crashing the kernel) would be an acceptable failure mode.
A more complex failure mode would be to have the other volumes in the
pool keep operating until they need a new extent - at which point they
too freeze.

With a readonly frozen pool, even in the case where metadata is also
full (so you can't add new extents), you can still add new storage and
copy logical volumes to a new pool (with more generous metadata and
chunk sizes).

It is not LVMs problem if the system crashes because a filesystem can't
handle a volume suddenly going readonly. All filesystems used in a
thinpool should be able to handle that situation.

Xen

2017-04-14 17:36:33 UTC

Gionatan Danti schreef op 14-04-2017 17:53:
> Il 14-04-2017 17:23 Xen ha scritto:
>> A thin snapshot won't be dropped. It is allocated with the same size
>> as the origin volume and hence can never fill up.
>>
>> Only the pool itself can fill up but unless you have some monitoring
>> software in place that can intervene in case of anomaly and kill the
>> snapshot, your system will or may simply freeze and not drop anything.
>>
>
> Yeah, I understand that. In that sentence, I was speaking about
> classic LVM snapshot.
>
> The dilemma is:
> - classic LVM snapshots have low performance (but adequate for backup
> purpose) and, if growing too much, snapshot activation can be
> problematic (especially on boot);
> - thin-snapshots have much better performance but does not always fail
> gracefully (ie: pool full).
>
> For nightly backups, what you would pick between the two?
> Thanks.

Oh, I'm sorry, I couldn't understand your message in that way.

I have a not very busy hobby server of sorts creating a snapshot every
day, mounting it and exporting it via NFS with some backup host that
will pull from it if everything keeps working ;-).

When I created the thing I thought that 1GB snapshot space would be
enough; there should not be many logs and everything worth something is
sitting on other partitions; so this is only the root volume and the
/var/log directory so to speak.

To my surprise regularly the update script emails me that when it
removed the root snapshot, it was not mounted.

When I log on during the day the snapshot is already half filled. I do
not know what causes this. I cannot find any logs or anything else that
would warrant such behaviour. But the best part of it all, is that the
system never suffers.

The thing is just dismounted apparently; I don't even know what causes
it.

The other volumes are thin. I am just very afraid of the thing filling
up due to some runaway process or an error on my part.

If I have a 30GB volume and a 30GB snapshot of that volume, and if this
volume is nearly empty and something starts filling it up, it will do
twice the writes to the thin pool. Any damage done is doubled.

The only thing that could save you (me) at this point is a process
instantly responding to some 90% full message and hoping it'd be in
time. Of course I don't have this monitoring in place; everything
requires work.

This is someone having written a script for Nagios:

https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_lvm/details

Then someone else did the same for NewRelic:

https://discuss.newrelic.com/t/lvm-thin-pool-monitoring/29295/17

My version of LVM indicates only the following:

# snapshot_library is the library used when monitoring a snapshot
device.
#
# "libdevmapper-event-lvm2snapshot.so" monitors the filling of
# snapshots and emits a warning through syslog when the use of
# the snapshot exceeds 80%. The warning is repeated when 85%, 90%
and
# 95% of the snapshot is filled.

snapshot_library = "libdevmapper-event-lvm2snapshot.so"

# thin_library is the library used when monitoring a thin device.
#
# "libdevmapper-event-lvm2thin.so" monitors the filling of
# pool and emits a warning through syslog when the use of
# the pool exceeds 80%. The warning is repeated when 85%, 90% and
# 95% of the pool is filled.

thin_library = "libdevmapper-event-lvm2thin.so"

I'm sorry, I was trying to discover how to use journalctl to check for
the message and it is just incredibly painful.

Gionatan Danti

2017-04-14 18:59:01 UTC

Il 14-04-2017 19:36 Xen ha scritto:
> The thing is just dismounted apparently; I don't even know what causes
> it.
>

Maybe running "iotop -a" for some hours can point you to the right
direction?

> The other volumes are thin. I am just very afraid of the thing filling
> up due to some runaway process or an error on my part.
>
> If I have a 30GB volume and a 30GB snapshot of that volume, and if
> this volume is nearly empty and something starts filling it up, it
> will do twice the writes to the thin pool. Any damage done is doubled.
>
> The only thing that could save you (me) at this point is a process
> instantly responding to some 90% full message and hoping it'd be in
> time. Of course I don't have this monitoring in place; everything
> requires work.

There is something similar already in place: when pool utilization is
over 95%, lvmthin *should* try a (lazy) umount. Give a look here:
https://www.redhat.com/archives/linux-lvm/2016-May/msg00042.html

Monitoring is a great thing; anyway, a safe fail policy would be *very*
nice...

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2017-04-14 19:20:53 UTC

Gionatan Danti schreef op 14-04-2017 20:59:
> Il 14-04-2017 19:36 Xen ha scritto:
>> The thing is just dismounted apparently; I don't even know what causes
>> it.
>>
>
> Maybe running "iotop -a" for some hours can point you to the right
> direction?
>
>> The other volumes are thin. I am just very afraid of the thing filling
>> up due to some runaway process or an error on my part.
>>
>> If I have a 30GB volume and a 30GB snapshot of that volume, and if
>> this volume is nearly empty and something starts filling it up, it
>> will do twice the writes to the thin pool. Any damage done is doubled.
>>
>> The only thing that could save you (me) at this point is a process
>> instantly responding to some 90% full message and hoping it'd be in
>> time. Of course I don't have this monitoring in place; everything
>> requires work.
>
> There is something similar already in place: when pool utilization is
> over 95%, lvmthin *should* try a (lazy) umount. Give a look here:
> https://www.redhat.com/archives/linux-lvm/2016-May/msg00042.html

I even forgot about that. I have such bad memory.

Checking back, the host that I am now on uses LVM 111 (Debian 8). The
next update is to... 111 ;-).

That was almost a year ago. You were using version 130 back then. I am
still on 111 on Debian ;-).

Zdenek recommended 142 back then.

I could take it out of testing though. Version 168.

> Monitoring is a great thing; anyway, a safe fail policy would be *very*
> nice...

A lazy umount does not invalidate any handles by processes for example
having a directory open.

I believe there was an issue with the remount -o ro call? Taking too
much resources for the daemon?

Anyway I am very happy that it happens if it happens; the umount.

I just don't feel comfortable about the system at all. I just don't want
it to crash :p.

Xen

2017-04-15 08:27:46 UTC

Xen schreef op 14-04-2017 19:36:

> I'm sorry, I was trying to discover how to use journalctl to check for
> the message and it is just incredibly painful.

So this is how you find out the messages of a certain program by
journalctl:

journalctl SYSLOG_IDENTIFIER=lvm

So user friendly ;-).

Then you need to mimick the behaviour of logtail and write your own
program for it.

You can save the cursor in journalctl so as to do that.

journalctl SYSLOG_IDENTIFIER=lvm --show-cursor

and then use it with --cursor or --after-cursor, but I have no clue what
the difference is.

I created a script to be run as a cron job that will email root in a
pretty nice message.

http://www.xenhideout.nl/scripts/snapshot-check.sh.txt

I only made it for regular snapshot messages currently though, I have
not yet seen fit to test the messages that
libdevmapper-event-lvm2thin.so produces.

But this is the format you can expect:

Snapshot linux/root-snap has been unmounted from /srv/root.

Log message:

Apr 14 21:11:01 perfection lvm[463]: Unmounting invalid snapshot
linux-root--snap from /srv/root.

Earlier messages:

Apr 14 19:25:31 perfection lvm[463]: Snapshot linux-root--snap is now
81% full.
Apr 14 19:25:41 perfection lvm[463]: Snapshot linux-root--snap is now
86% full.
Apr 14 21:10:51 perfection lvm[463]: Snapshot linux-root--snap is now
93% full.
Apr 14 21:11:01 perfection lvm[463]: Snapshot linux-root--snap is now
97% full.

I haven't yet fully tested everything but it saves the cursor in
/run/lvm/lastlog ;-), and will not produce output when there is nothing
new. It will not produce any output when run as a cron job and it will
produce status messages when run interactively.

It has options like:
-c : clear the cursor
-u : update the cursor and nothing else
-d : dry-run
--help

There is not yet an option to send the email to stdout but if you botch
up the script or emailing fails it will put the email to stderr so cron
will pick it up.

I guess the next option is to not send an email or to send it to stdout
instead (also picked up by cron normally).

In any case I assume current LVM already has ways of running scripts,
but that depends on dmeventd....

So you could have different actions in the script as well and have it
run other scripts.

Currently the action is simply to email...

Once I have tested thin fillup I will put that in too :p.

Regards.

Xen

2017-04-15 23:35:16 UTC

Xen schreef op 15-04-2017 10:27:

> http://www.xenhideout.nl/scripts/snapshot-check.sh.txt

My script now does thin pool reporting, at least for the data volume
that I could check (tpool).

:p.

It can create messages such as this :p.

Thin is currently at 80%.

Log messages:

Apr 16 00:00:12 perfection lvm[463]: Thin linux-thin-tpool is now 80%
full.
Apr 15 18:10:42 perfection lvm[463]: Thin linux-thin-tpool is now 95%
full.
Apr 15 18:10:22 perfection lvm[463]: Thin linux-thin-tpool is now 92%
full.

Previous messages:

Apr 15 14:38:12 perfection lvm[463]: Thin linux-thin-tpool is now 85%
full.
Apr 15 14:37:12 perfection lvm[463]: Thin linux-thin-tpool is now 80%
full.

The cursor of journalctl was at the 85% mark; that is why an earlier
invocation would have shown the last 2 messages, while in this sense in
this invocation the above three would be displayed and found.

(I copied an older cursor file over the cursor location).

So it shows all new messages when there is something new to be shown and
it uses the occasion to also remind you of older messages.

Still working on something better...

But this is already quite nice.

Basically it sends 3 types of emails:

- snapshot filling up
- snapshot filled up completely
- thin pool filling up

But it only responds to dmevent messages in syslog. Of course you could
take the opportunity to give much more detailed information which is
what I am working on but this does require invocations of lvs etc.

Xen

2017-04-17 12:33:35 UTC

Xen schreef op 15-04-2017 10:27:

> I created a script to be run as a cron job that will email root in a
> pretty nice message.
>
> http://www.xenhideout.nl/scripts/snapshot-check.sh.txt

Just was so happy. I guess I can still improve the email, but:

Snapshot linux/root-snap has been unmounted from /srv/root because it
filled up to a 100%.

Log message:

Apr 17 14:08:38 perfection lvm[463]: Unmounting invalid snapshot
linux-root--snap from /srv/root.

Earlier messages:

Apr 17 14:08:21 perfection lvm[463]: Snapshot linux-root--snap is now
96% full.
Apr 17 14:08:01 perfection lvm[463]: Snapshot linux-root--snap is now
91% full.
Apr 17 14:07:51 perfection lvm[463]: Snapshot linux-root--snap is now
86% full.
Apr 17 14:07:31 perfection lvm[463]: Snapshot linux-root--snap is now
81% full.
-------------------------------------------------------------------------------
Apr 14 21:11:01 perfection lvm[463]: Snapshot linux-root--snap is now
97% full.
Apr 14 21:10:51 perfection lvm[463]: Snapshot linux-root--snap is now
93% full.
Apr 14 19:25:41 perfection lvm[463]: Snapshot linux-root--snap is now
86% full.
Apr 14 19:25:31 perfection lvm[463]: Snapshot linux-root--snap is now
81% full.

I was just upgrading packages hence the snapshot filled up quickly.

System works well. I don't get instant reports but if something happens
within the space of 5 minutes it is too late anyway.

Only downside is that thin messages get repeated whenever snapshots are
(re)created. So lvmetad will output new message for me at every 0:00. So
if thin is > 80%, every day (for me) there is a new message for no
reason in that sense.

Xen

2017-04-15 21:22:58 UTC

Gionatan Danti schreef op 14-04-2017 20:59:
> Il 14-04-2017 19:36 Xen ha scritto:
>> The thing is just dismounted apparently; I don't even know what causes
>> it.
>>
>
> Maybe running "iotop -a" for some hours can point you to the right
> direction?

I actually think it is enough if 225 extents get written. The snapshot
is 924 MB or 250-25 extents.

I think it only needs to write in 225 different places on the disk (225
different 4MB sectors) to fill the snapshot up.

Cause there is no way in hell that an actual 924 MB would be written,
because the entire system is not more than 5GB and the entire systemd
journal is not more than maybe 28 MB :p.

Xen

2017-04-15 21:49:57 UTC

Xen schreef op 15-04-2017 23:22:

> I actually think it is enough if 225 extents get written. The snapshot
> is 924 MB or 250-25 extents.

Erm, that's 256 - 25 = 231. My math is good today :p.

Xen

2017-04-15 21:48:32 UTC

Gionatan Danti schreef op 14-04-2017 20:59:

> There is something similar already in place: when pool utilization is
> over 95%, lvmthin *should* try a (lazy) umount. Give a look here:
> https://www.redhat.com/archives/linux-lvm/2016-May/msg00042.html
>
> Monitoring is a great thing; anyway, a safe fail policy would be *very*
> nice...

This is the idea I had back then:

- reserve space for calamities.

- when running out of space, start informing the filesystem(s).

- communicate individual unusable blocks or simple a number of
unavailable blocks through some inter-layer communication system.

But it was said such channels do not exist or that the concept of a
block device (a logical addressing space) suddenly having trouble
delivering the blocks would be a conflicting concept.

If the concept of a filesystem needing to deal with disappearing space
were to be made live,

what you would get was.

that there starts to grow some hidden block of unusable space.

Supposing that you have 3 volumes of sizes X Y and Z.

With the constraint that currently individually each volume is capable
of using all space it wants,

now volume X starts to use up more space and the available remaining
space is no longer enough for Z.

The space available to all volumes is equivalent and is only constrained
by their own virtual sizes.

So saying that for each volume the available space = min( own filesystem
space, available thin space )

any consumption by any of the other volumes will see a reduction of the
available space by the same amount for the other volumes.

For the using volume this is to be expected, for the other volumes this
is strange.

each consumption turns into a lessening for all the other volumes
including the own.

this reduction of space is therefore a single number that pertains to
all volumes and only comes to be in any kind of effect if the real
available space is less than the (filesystem oriented, but rather LVM
determined) virtual space the volume thought it had.

for all volumes that are effected, there is now a discrepancy between
virtual available space and real available space.

this differs per volume but is really just a substraction. However LVM
should be able to know about this number since it is just about a number
of extents available and 'needed'.

Zdenek said that this information is not available in a live fashion
because the algorithms that find a new free extent need to go look for
it first.

Regardless if this information was available it could be communicated to
the logical volume who could communicate it to the filesystem.

There are two ways: polling a number through some block device command
or telling the filesystem through a daemon.

Remounting the filesystem read-only is one such "through a daemon"
command.

Zdenek said that dmevent plugins cannot issue remount request because
the system call is too big.

But it would be important that filesystem has feature for dealing with
unavailable space for example by forcing it to reserve a certain amount
of space in a live or dynamic fashion.

Zdenek Kabelac

2017-04-18 10:17:09 UTC

Dne 15.4.2017 v 23:48 Xen napsal(a):
> Gionatan Danti schreef op 14-04-2017 20:59:
>
>> There is something similar already in place: when pool utilization is
>> over 95%, lvmthin *should* try a (lazy) umount. Give a look here:
>> https://www.redhat.com/archives/linux-lvm/2016-May/msg00042.html
>>
>> Monitoring is a great thing; anyway, a safe fail policy would be *very* nice...
>
> This is the idea I had back then:
>
> - reserve space for calamities.
>
> - when running out of space, start informing the filesystem(s).
>
> - communicate individual unusable blocks or simple a number of unavailable
> blocks through some inter-layer communication system.
>
> But it was said such channels do not exist or that the concept of a block
> device (a logical addressing space) suddenly having trouble delivering the
> blocks would be a conflicting concept.
>
> If the concept of a filesystem needing to deal with disappearing space were to
> be made live,
>
> what you would get was.
>
> that there starts to grow some hidden block of unusable space.
>
> Supposing that you have 3 volumes of sizes X Y and Z.
>
> With the constraint that currently individually each volume is capable of
> using all space it wants,
>
> now volume X starts to use up more space and the available remaining space is
> no longer enough for Z.
>
> The space available to all volumes is equivalent and is only constrained by
> their own virtual sizes.
>
> So saying that for each volume the available space = min( own filesystem
> space, available thin space )
>
> any consumption by any of the other volumes will see a reduction of the
> available space by the same amount for the other volumes.
>
> For the using volume this is to be expected, for the other volumes this is
> strange.
>
> each consumption turns into a lessening for all the other volumes including
> the own.
>
> this reduction of space is therefore a single number that pertains to all
> volumes and only comes to be in any kind of effect if the real available space
> is less than the (filesystem oriented, but rather LVM determined) virtual
> space the volume thought it had.
>
> for all volumes that are effected, there is now a discrepancy between virtual
> available space and real available space.
>
> this differs per volume but is really just a substraction. However LVM should
> be able to know about this number since it is just about a number of extents
> available and 'needed'.
>
> Zdenek said that this information is not available in a live fashion because
> the algorithms that find a new free extent need to go look for it first.

Already got lost in lots of posts.

But there is tool 'thin_ls' which can be used for detailed info about used
space by every single thin volume.

It's not support directly by 'lvm2' command (so not yet presented in shiny
cool way via 'lvs -a') - but user can relatively easily run this command
on his own on life pool.

See for usage of

dmsetup message /dev/mapper/pool 0
[ reserve_metadata_snap | release_metadata_snap ]

and 'man thin_ls'

Just don't forget to release snapshot of thin-pool kernel metadata once it's
not needed...

> There are two ways: polling a number through some block device command or
> telling the filesystem through a daemon.
>
> Remounting the filesystem read-only is one such "through a daemon" command.
>

Unmount of thin-pool has been dropped from upstream version >169.
It's now delegated to user script executed on % checkpoints
(see 'man dmeventd')

Regards

Zdenek

Gionatan Danti

2017-04-18 13:23:50 UTC

On 18/04/2017 12:17, Zdenek Kabelac wrote:
> Unmount of thin-pool has been dropped from upstream version >169.
> It's now delegated to user script executed on % checkpoints
> (see 'man dmeventd')

Hi Zdenek,
I missed that; thanks.

Any thoughts on the original question? For snapshot with relatively big
CoW table, from a stability standpoint, how do you feel about classical
vs thin-pool snapshot?

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Stuart D. Gathman

2017-04-18 14:32:34 UTC

On Tue, 18 Apr 2017, Gionatan Danti wrote:

> Any thoughts on the original question? For snapshot with relatively big CoW
> table, from a stability standpoint, how do you feel about classical vs
> thin-pool snapshot?

Classic snapshots are rock solid. There is no risk to the origin
volume. If the snapshot CoW fills up, all reads and all writes to the
*snapshot* return IOError. The origin is unaffected.

If a classic snapshot exists across a reboot, then the entire CoW table
(but not the data chunks) must be loaded into memory when the snapshot
(or origin) is activated. This can greatly delay boot for a large CoW.

For the common purpose of temporary snapsnots for consistent backups,
this is not an issue.

--
Stuart D. Gathman <***@gathman.org>
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

Xen

2017-04-19 07:22:53 UTC

Zdenek Kabelac schreef op 18-04-2017 12:17:

> Already got lost in lots of posts.
>
> But there is tool 'thin_ls' which can be used for detailed info
> about used space by every single thin volume.
>
> It's not support directly by 'lvm2' command (so not yet presented in
> shiny cool way via 'lvs -a') - but user can relatively easily run this
> command
> on his own on life pool.
>
>
> See for usage of
>
>
> dmsetup message /dev/mapper/pool 0
> [ reserve_metadata_snap | release_metadata_snap ]
>
> and 'man thin_ls'
>
>
> Just don't forget to release snapshot of thin-pool kernel metadata
> once it's not needed...
>
>> There are two ways: polling a number through some block device command
>> or telling the filesystem through a daemon.
>>
>> Remounting the filesystem read-only is one such "through a daemon"
>> command.
>>
>
> Unmount of thin-pool has been dropped from upstream version >169.
> It's now delegated to user script executed on % checkpoints
> (see 'man dmeventd')

So I write something useless again ;-).

Always this issue with versions...

So Let's see, Debian Unstable (Sid) still has version 168 as does
Testing (Stretch).
Ubuntu Zesty Zapus (17.04) has 167.

So for the foreseeable future both those distributions won't have that
feature at least.

I heard you speak of those scripts yes but I did not know when or what
yet, thanks.

I guess my script could be run directly from the script execution in the
future then.

Thanks for responding though, much obliged.

Gionatan Danti

2017-04-08 11:56:50 UTC

Il 08-04-2017 00:24 Mark Mielke ha scritto:
>
> We use lvmthin in many areas... from Docker's dm-thinp driver, to XFS
> file systems for PostgreSQL or other data that need multiple
> snapshots, including point-in-time backup of certain snapshots. Then,
> multiple sizes. I don't know that we have 8 TB anywhere right this
> second, but we are using it in a variety of ranges from 20 GB to 4 TB.
>

Very interesting, this is the exact information I hoped to get. Thank
you for reporting.

>
> When you say "nightly", my experience is that processes are writing
> data all of the time. If the backup takes 30 minutes to complete, then
> this is 30 minutes of writes that get accumulated, and subsequent
> performance overhead of these writes.
>
> But, we usually keep multiple hourly snapshots and multiply daily
> snapshots, because we want the option to recover to different points
> in time. With the classic LVM snapshot capability, I believe this is
> essentially non-functional. While it can work with "1 short lived
> snapshot", I don't think it works at all well for "3 hourly + 3 daily
> snapshots". Remember that each write to an area will require that
> area to be replicated multiple times under classic LVM snapshots,
> before the original write can be completed. Every additional snapshot
> is an additional cost.

Right. For such a setup, classic LVM snapshot overhead would be
enormous, grinding all to an halt.

>
>> I more concerned about lenghtly snapshot activation due to a big,
>> linear CoW table that must be read completely...
>
> I suspect this is a pre-optimization concern, in that you are
> concerned, and you are theorizing about impact, but perhaps you
> haven't measured it yourself, and if you did, you would find there was
> no reason to be concerned. :-)

For classic (non-thinly provided) LVM snapshot, relatively big metadata
size was a know problem. Many talks happened on this list for this very
topic. Basically, when the snapshot metadata size increased above a
certain point (measured in some GB), snapshot activation failed due to
timeout on LVM commands. This, in turn, was due that legacy snapshot
behavior was not really tuned for long-lived, multi-gigabyte snapshots,
rather for create-backup-remove behavior.

>
> If you absolutely need a contiguous sequence of blocks for your
> drives, because your I/O patterns benefit from this, or because your
> hardware has poor seek performance (such as, perhaps a tape drive? :-)
> ), then classic LVM snapshots would retain this ordering for the live
> copy, and the snapshot could be as short lived as possible to minimize
> overhead to only that time period.
>
> But, in practice - I think the LVM authors of the thinpool solution
> selected a default block size that would exhibit good behaviour on
> most common storage solutions. You can adjust it, but in most cases I
> think I don't bother, and just use the default. There is also the
> behaviour of the systems in general to take into account in that even
> if you had a purely contiguous sequence of blocks, your file system
> probably allocates files all over the drive anyways. With XFS, I
> believe they do this for concurrency, in that two different kernel
> threads can allocate new files without blocking each other, because
> they schedule the writes to two different areas of the disk, with
> separate inode tables.
>
> So, I don't believe the contiguous sequence of blocks is normally a
> real thing. Perhaps a security camera that is recording a 1+ TB video
> stream might allocate contiguous, but basically nothing else does
> this.

True.

>
> To me, LVM thin volumes is the right answer to this problem. It's not
> particularly new or novel either. Most "Enterprise" level storage
> systems have had this capability for many years. At work, we use
> NetApp and they take this to another level with their WAFL =
> Write-Anywhere-File-Layout. For our private cloud solution based upon
> NetApp AFF 8080EX today, we have disk shelves filled with flash
> drives, and NetApp is writing everything "forwards", which extends the
> life of the flash drives, and allows us to keep many snapshots of the
> data. But, it doesn't have to be flash to take advantage of this. We
> also have large NetApp FAS 8080EX or 8060 with all spindles, including
> 3.5" SATA disks. I was very happy to see this type of technology make
> it back into LVM. I think this breathed new life into LVM, and made it
> a practical solution for many new use cases beyond being just a more
> flexible partition manager.
>
> --
>
> Mark Mielke <***@gmail.com>

Yeah, CoW-enabled filesystem are really cool ;) Too bad BTRFS has very
low performance when used as VM backing store...

Thank you very much Mark, I really appreciate the information you
provided.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Tomas Dalebjörk

2017-04-07 18:21:36 UTC

Hi

Agent less snapshot of the vm server might be an issue with application
running in the vm guest os.
Especially as there are no VSS like features on linux.

Perhaps someone can introduce a udev listener that can be used?

Den 6 apr. 2017 16:32 skrev "Gionatan Danti" <***@assyoma.it>:

> Hi all,
> I'm seeking some advice for a new virtualization system (KVM) on top of
> LVM. The goal is to take agentless backups via LVM snapshots.
>
> In short: what you suggest to snapshot a quite big (8+ TB) volume? Classic
> LVM (with old snapshot behavior) or thinlvm (and its new snapshot method)?
>
> Long story:
> In the past, I used classical, preallocated logical volumes directly
> exported as virtual disks. In this case, I snapshot the single LV I want to
> backup and, using dd/ddrescue, I copy it.
>
> Problem is this solution prevents any use of thin allocation or sparse
> files, so I tried to replace it with something filesystem-based. Lately I
> used another approach, configuring a single thinly provisioned LV (with no
> zeroing) + XFS + raw or qcow2 virtual machine images. To make backups, I
> snapshotted the entire thin LV and, after mounting it, I copied the
> required files.
>
> So far this second solution worked quite well. However, before using it in
> more and more installations, I wonder if it is the correct approach or if
> something better, especially from a stability standpoint, is possible.
>
> Gived that I would like to use XFS, and that I need snapshots at the block
> level, two possibilities came to mind:
>
> 1) continue to use thinlvm + thin snapshots + XFS. What do you think about
> a 8+ TB thin pool/volume with relatively small (64/128KB) chunks? Would you
> be comfortable using it in production workloads? What about powerloss
> protection? From my understanding, thinlvm passes flushes anytime the
> higher layers issue them and so should be reasonable safe against
> unexpected powerloss. Is this view right?
>
> 2) use a classic (non-thin) LVM + normal snapshot + XFS. I know for sure
> that LV size is not an issue here, however big snapshot size used to be
> problematic: the CoW table had to be read completely before the snapshot
> can be activated. Is this problem a solved one? Or big snapshot can be
> problematic?
>
> Thank you all.
>
> --
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: ***@assyoma.it - ***@assyoma.it
> GPG public key ID: FF5F32A8
>
> _______________________________________________
> linux-lvm mailing list
> linux-***@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>

Gionatan Danti

2017-04-13 10:20:10 UTC

On 06/04/2017 16:31, Gionatan Danti wrote:
> Hi all,
> I'm seeking some advice for a new virtualization system (KVM) on top of
> LVM. The goal is to take agentless backups via LVM snapshots.
>
> In short: what you suggest to snapshot a quite big (8+ TB) volume?
> Classic LVM (with old snapshot behavior) or thinlvm (and its new
> snapshot method)?
>
> Long story:
> In the past, I used classical, preallocated logical volumes directly
> exported as virtual disks. In this case, I snapshot the single LV I want
> to backup and, using dd/ddrescue, I copy it.
>
> Problem is this solution prevents any use of thin allocation or sparse
> files, so I tried to replace it with something filesystem-based. Lately
> I used another approach, configuring a single thinly provisioned LV
> (with no zeroing) + XFS + raw or qcow2 virtual machine images. To make
> backups, I snapshotted the entire thin LV and, after mounting it, I
> copied the required files.
>
> So far this second solution worked quite well. However, before using it
> in more and more installations, I wonder if it is the correct approach
> or if something better, especially from a stability standpoint, is
> possible.
>
> Gived that I would like to use XFS, and that I need snapshots at the
> block level, two possibilities came to mind:
>
> 1) continue to use thinlvm + thin snapshots + XFS. What do you think
> about a 8+ TB thin pool/volume with relatively small (64/128KB) chunks?
> Would you be comfortable using it in production workloads? What about
> powerloss protection? From my understanding, thinlvm passes flushes
> anytime the higher layers issue them and so should be reasonable safe
> against unexpected powerloss. Is this view right?
>
> 2) use a classic (non-thin) LVM + normal snapshot + XFS. I know for sure
> that LV size is not an issue here, however big snapshot size used to be
> problematic: the CoW table had to be read completely before the snapshot
> can be activated. Is this problem a solved one? Or big snapshot can be
> problematic?
>
> Thank you all.
>

Hi,
anyone with other thoughts on the matter?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2017-04-13 12:41:45 UTC

Gionatan Danti schreef op 13-04-2017 12:20:

> Hi,
> anyone with other thoughts on the matter?

I wondered why a single thin LV does work for you in terms of not
wasting space or being able to make more efficient use of "volumes" or
client volumes or whatever.

But a multitude of thin volumes won't.

See, you only compared multiple non-thin with a single-thin.

So my question is:

did you consider multiple thin volumes?

Gionatan Danti

2017-04-14 07:20:14 UTC

Il 13-04-2017 14:41 Xen ha scritto:
>
> See, you only compared multiple non-thin with a single-thin.
>
> So my question is:
>
> did you consider multiple thin volumes?
>

Hi, the multiple-thin-volume solution, while being very flexible, is not
well understood by libvirt and virt-manager. So I need to pass on that
(for the moment at least).

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-14 08:24:10 UTC

Dne 14.4.2017 v 09:20 Gionatan Danti napsal(a):
> Il 13-04-2017 14:41 Xen ha scritto:
>>
>> See, you only compared multiple non-thin with a single-thin.
>>
>> So my question is:
>>
>> did you consider multiple thin volumes?
>>
>
> Hi, the multiple-thin-volume solution, while being very flexible, is not well
> understood by libvirt and virt-manager. So I need to pass on that (for the
> moment at least).
>

Well since recent versions of lvm2 (>=169 , even though they are marked as
exprimental) - do support script execution of a command for easier maintanence
of thin-pool being filled above some percentage.

So it should be 'relatively' easy to setup a solution where you can fill
your pool to i.e. 90% and if gets above - kill your surrounding libvirt,
and resolve missing resources (deleting virt machines..)

But it's currently impossible to expect you will fill the thin-pool to full
capacity and everything will continue to run smoothly - this is not going to
happen.

However there are many different solutions for different problems - and with
current script execution - user may build his own solution - i.e. call
'dmsetup remove -f' for running thin volumes - so all instances get 'error'
device when pool is above some threshold setting (just like old 'snapshot'
invalidation worked) - this way user will just kill thin volume user task, but
will still keep thin-pool usable for easy maintenance.

Regards

Zdenek

Gionatan Danti

2017-04-14 09:07:53 UTC

Il 14-04-2017 10:24 Zdenek Kabelac ha scritto:
>
> But it's currently impossible to expect you will fill the thin-pool to
> full capacity and everything will continue to run smoothly - this is
> not going to happen.

Even with EXT4 and errors=remount-ro?

>
> However there are many different solutions for different problems -
> and with current script execution - user may build his own solution -
> i.e. call
> 'dmsetup remove -f' for running thin volumes - so all instances get
> 'error' device when pool is above some threshold setting (just like
> old 'snapshot' invalidation worked) - this way user will just kill
> thin volume user task, but will still keep thin-pool usable for easy
> maintenance.
>

Interesting. However, the main problem with libvirt is that its
pool/volume management fall apart when used on thin-pools. Basically,
libvirt does not understand that a thinpool is a container for thin
volumes (ie:
https://www.redhat.com/archives/libvirt-users/2014-August/msg00010.html)

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-14 09:37:37 UTC

Dne 14.4.2017 v 11:07 Gionatan Danti napsal(a):
> Il 14-04-2017 10:24 Zdenek Kabelac ha scritto:
>>
>> But it's currently impossible to expect you will fill the thin-pool to
>> full capacity and everything will continue to run smoothly - this is
>> not going to happen.
>
> Even with EXT4 and errors=remount-ro?

While usage of 'remount-ro' may prevent any significant damage of filesystem
as such - since the 1st. problem detected by ext4 stops - it's still not quite
trivial to proceed easily further.

The problem is not with 'stopping' access - but to gain the access back.

So in this case - you need to run 'fsck' - and this fsck usually needs more
space - and the complexity starts with - where to get this space.

In the the 'most trivial' case - you have the space in 'VG' - you just extend
thin-pool and you run 'fsck' and it works.

But then there is number of cases ending with the case that you run out of
metadata space that has the maximal size of ~16G so you can't even extend it,
simply because it's unsupported to use any bigger size.

So while every case has some way forward how to proceed - none of them could
be easily automated.

And it's so much easier to monitor and prevent this to happen compared with
solving these thing later.

So all is needed is - user is aware what he is using and does proper action
and proper time.

>
>>
>> However there are many different solutions for different problems -
>> and with current script execution - user may build his own solution -
>> i.e. call
>> 'dmsetup remove -f' for running thin volumes - so all instances get
>> 'error' device when pool is above some threshold setting (just like
>> old 'snapshot' invalidation worked) - this way user will just kill
>> thin volume user task, but will still keep thin-pool usable for easy
>> maintenance.
>>
>
> Interesting. However, the main problem with libvirt is that its pool/volume
> management fall apart when used on thin-pools. Basically, libvirt does not
> understand that a thinpool is a container for thin volumes (ie:
> https://www.redhat.com/archives/libvirt-users/2014-August/msg00010.html)

Well lvm2 provides the low-level tooling here....

Zdenek

Gionatan Danti

2017-04-14 09:55:21 UTC

Il 14-04-2017 11:37 Zdenek Kabelac ha scritto:
> The problem is not with 'stopping' access - but to gain the access
> back.
>
> So in this case - you need to run 'fsck' - and this fsck usually needs
> more space - and the complexity starts with - where to get this space.
>
> In the the 'most trivial' case - you have the space in 'VG' - you just
> extend thin-pool and you run 'fsck' and it works.
>
> But then there is number of cases ending with the case that you run
> out of metadata space that has the maximal size of ~16G so you can't
> even extend it, simply because it's unsupported to use any bigger
> size.
>
> So while every case has some way forward how to proceed - none of them
> could be easily automated.

To better understand: what would be the (manual) solution here, if
metadata are full and can not be extended due to the hard 16 GB limit?

> And it's so much easier to monitor and prevent this to happen compared
> with solving these thing later.
>
> So all is needed is - user is aware what he is using and does proper
> action and proper time.
>

Absolutely. However, monitoring can also fail - a safe failure model is
a really important thing.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2017-04-22 16:32:11 UTC

Gionatan Danti schreef op 22-04-2017 9:14:
> Il 14-04-2017 10:24 Zdenek Kabelac ha scritto:
>> However there are many different solutions for different problems -
>> and with current script execution - user may build his own solution -
>> i.e. call
>> 'dmsetup remove -f' for running thin volumes - so all instances get
>> 'error' device when pool is above some threshold setting (just like
>> old 'snapshot' invalidation worked) - this way user will just kill
>> thin volume user task, but will still keep thin-pool usable for easy
>> maintenance.
>>
>
> This is a very good idea - I tried it and it indeed works.

So a user script can execute dmsetup remove -f on the thin pool?

Oh no, for all volumes.

That is awesome, that means a errors=remount-ro mount will cause a
remount right?

> However, it is not very clear to me what is the best method to monitor
> the allocated space and trigger an appropriate user script (I
> understand that versione > .169 has %checkpoint scripts, but current
> RHEL 7.3 is on .166).
>
> I had the following ideas:
> 1) monitor the syslog for the "WARNING pool is dd.dd% full" message;

This is what my script is doing of course. It is a bit ugly and a bit
messy by now, but I could still clean it up :p.

However it does not follow syslog, but checks periodically. You can also
follow with -f.

It does not allow for user specified actions yet.

In that case it would fulfill the same purpose as > 169 only a bit more
poverly.

> One more thing: from device-mapper docs (and indeed as observerd in my
> tests), the "pool is dd.dd% full" message is raised one single time:
> if a message is raised, the pool is emptied and refilled, no new
> messages are generated. The only method I found to let the system
> re-generate the message is to deactiveate and reactivate the thin pool
> itself.

This is not my experience on LVM 111 from Debian.

For me new messages are generated when:

- the pool reaches any threshold again
- I remove and recreate any thin volume.

Because my system regenerates snapshots, I now get an email from my
script when the pool is > 80%, every day.

So if I keep the pool above 80%, every day at 0:00 I get an email about
it :p. Because syslog gets a new entry for it. This is why I know :p.

> And now the most burning question ... ;)
> Given that thin-pool is under monitor and never allowed to fill
> data/metadata space, as do you consider its overall stability vs
> classical thick LVM?
>
> Thanks.

Gionatan Danti

2017-04-22 20:58:10 UTC

Il 22-04-2017 18:32 Xen ha scritto:
> This is not my experience on LVM 111 from Debian.
>
> For me new messages are generated when:
>
> - the pool reaches any threshold again
> - I remove and recreate any thin volume.
>
> Because my system regenerates snapshots, I now get an email from my
> script when the pool is > 80%, every day.
>
> So if I keep the pool above 80%, every day at 0:00 I get an email
> about it :p. Because syslog gets a new entry for it. This is why I
> know :p.
>

Interesting, I had to try that ;)
Thanks for suggesting.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-22 21:17:42 UTC

Dne 22.4.2017 v 18:32 Xen napsal(a):
> Gionatan Danti schreef op 22-04-2017 9:14:
>> Il 14-04-2017 10:24 Zdenek Kabelac ha scritto:
>>> However there are many different solutions for different problems -
>>> and with current script execution - user may build his own solution -
>>> i.e. call
>>> 'dmsetup remove -f' for running thin volumes - so all instances get
>>> 'error' device when pool is above some threshold setting (just like
>>> old 'snapshot' invalidation worked) - this way user will just kill
>>> thin volume user task, but will still keep thin-pool usable for easy
>>> maintenance.
>>>
>>
>> This is a very good idea - I tried it and it indeed works.
>
> So a user script can execute dmsetup remove -f on the thin pool?
>
> Oh no, for all volumes.
>
> That is awesome, that means a errors=remount-ro mount will cause a remount right?

Well 'remount-ro' will fail but you will not be able to read anything
from volume as well.

So as said - many users many different solutions are needed...

Currently lvm2 can't support that much variety and complexity...

>
>> However, it is not very clear to me what is the best method to monitor
>> the allocated space and trigger an appropriate user script (I
>> understand that versione > .169 has %checkpoint scripts, but current
>> RHEL 7.3 is on .166).
>>
>> I had the following ideas:
>> 1) monitor the syslog for the "WARNING pool is dd.dd% full" message;
>
> This is what my script is doing of course. It is a bit ugly and a bit messy by
> now, but I could still clean it up :p.
>
> However it does not follow syslog, but checks periodically. You can also
> follow with -f.
>
> It does not allow for user specified actions yet.
>
> In that case it would fulfill the same purpose as > 169 only a bit more poverly.
>
>> One more thing: from device-mapper docs (and indeed as observerd in my
>> tests), the "pool is dd.dd% full" message is raised one single time:
>> if a message is raised, the pool is emptied and refilled, no new
>> messages are generated. The only method I found to let the system
>> re-generate the message is to deactiveate and reactivate the thin pool
>> itself.
>
> This is not my experience on LVM 111 from Debian.
>
> For me new messages are generated when:
>
> - the pool reaches any threshold again
> - I remove and recreate any thin volume.
>
> Because my system regenerates snapshots, I now get an email from my script
> when the pool is > 80%, every day.
>
> So if I keep the pool above 80%, every day at 0:00 I get an email about it :p.
> Because syslog gets a new entry for it. This is why I know :p.

The explanation here is simple - when you create a new thinLV - there is
currently full suspend - and before 'suspend' pool is 'unmonitored'
after resume again monitored - and you get your warning logged again.

Zdenek

Xen

2017-04-23 05:29:32 UTC

Zdenek Kabelac schreef op 22-04-2017 23:17:

>> That is awesome, that means a errors=remount-ro mount will cause a
>> remount right?
>
> Well 'remount-ro' will fail but you will not be able to read anything
> from volume as well.

Well that is still preferable to anything else.

It is preferable to a system crash, I mean.

So if there is no other last rather, I think this is really the only
last resort that exists?

Or maybe one of the other things Gionatan suggested.

> Currently lvm2 can't support that much variety and complexity...

I think it's simpler but okay, sure...

I think pretty much anyone would prefer a volume-read-errors system
rather than a kernel-hang system.

It is just not of the same magnitude of disaster :p.

> The explanation here is simple - when you create a new thinLV - there
> is currently full suspend - and before 'suspend' pool is 'unmonitored'
> after resume again monitored - and you get your warning logged again.

Right, yes, that's what syslog says.

It does make it a bit annoying to be watching for messages but I guess
it means filtering for the monitoring messages too.

If you want to filter out the recurring message, or check current thin
pool usage before you send anything.

Zdenek Kabelac

2017-04-23 09:26:43 UTC

Dne 23.4.2017 v 07:29 Xen napsal(a):
> Zdenek Kabelac schreef op 22-04-2017 23:17:
>
>>> That is awesome, that means a errors=remount-ro mount will cause a remount
>>> right?
>>
>> Well 'remount-ro' will fail but you will not be able to read anything
>> from volume as well.
>
> Well that is still preferable to anything else.
>
> It is preferable to a system crash, I mean.
>
> So if there is no other last rather, I think this is really the only last
> resort that exists?
>
> Or maybe one of the other things Gionatan suggested.
>
>> Currently lvm2 can't support that much variety and complexity...
>
> I think it's simpler but okay, sure...
>
> I think pretty much anyone would prefer a volume-read-errors system rather
> than a kernel-hang system.

I'm just currious - what the you think will happen when you have
root_LV as thin LV and thin pool runs out of space - so 'root_LV'
is replaced with 'error' target.

How do you think this will be ANY different from hanging your system ?

> It is just not of the same magnitude of disaster :p.

IMHO reboot is still quite fair solution in such case.

Regards

Zdenek

Xen

2017-04-24 21:02:36 UTC

Zdenek Kabelac schreef op 23-04-2017 11:26:

> I'm just currious - what the you think will happen when you have
> root_LV as thin LV and thin pool runs out of space - so 'root_LV'
> is replaced with 'error' target.

Why do you suppose Root LV is on thin?

Why not just stick to the common scenario when thin is used for extra
volumes or data?

I mean to say that you are raising an exceptional situation as an
argument against something that I would consider quite common, which
doesn't quite work that way: you can't prove that most people would not
want something by raising something most people wouldn't use.

I mean to say let's just look at the most common denominator here.

Root LV on thin is not that.

I have tried it, yes. Gives troubles with Grub and requires thin package
to be installed on all systems and makes it harder to install a system
too.

Thin root LV is not the idea for most people.

So again, don't you think having data volumes produce errors is not
preferable to having the entire system hang?

> How do you think this will be ANY different from hanging your system ?

Doesn't happen cause you're not using that.

You're smarter than that.

So it doesn't happen and it's not a use case here.

> IMHO reboot is still quite fair solution in such case.

That's irrelevant; if the thin pool is full you need to mitigate it,
rebooting won't help with that.

And if your root is on thin, rebooting won't do you much good either. So
you had best keep a running system in which you can mitigate it live
instead of rebooting to avail.

That's just my opinion and a lot more commonsensical than what you just
said, I think.

But to each his own.

Zdenek Kabelac

2017-04-24 21:59:06 UTC

Dne 24.4.2017 v 23:02 Xen napsal(a):
> Zdenek Kabelac schreef op 23-04-2017 11:26:
>
>> I'm just currious - what the you think will happen when you have
>> root_LV as thin LV and thin pool runs out of space - so 'root_LV'
>> is replaced with 'error' target.
>
> Why do you suppose Root LV is on thin?
>
> Why not just stick to the common scenario when thin is used for extra volumes
> or data?
>
> I mean to say that you are raising an exceptional situation as an argument
> against something that I would consider quite common, which doesn't quite work
> that way: you can't prove that most people would not want something by raising
> something most people wouldn't use.
>
> I mean to say let's just look at the most common denominator here.
>
> Root LV on thin is not that.

Well then you might be surprised - there are user using exactly this.

When you have rootLV on thinLV - you could easily snapshot it before doing any
upgrade and revert back in case something fails on upgrade.
See also projects like snapper...

>
> I have tried it, yes. Gives troubles with Grub and requires thin package to be
> installed on all systems and makes it harder to install a system too.

lvm2 is cooking some better boot support atm....

> Thin root LV is not the idea for most people.
>
> So again, don't you think having data volumes produce errors is not preferable
> to having the entire system hang?

Not sure why you insist system hangs.

If system hangs - and you have recent kernel & lvm2 - you should fill bug.

If you set '--errorwhenfull y' - it should instantly fail.

There should not be any hanging..

> That's irrelevant; if the thin pool is full you need to mitigate it, rebooting
> won't help with that.

well it's really admins task to solve the problem after panic call.
(adding new space).

Thin users can't expect to overload system in crazy way and expect the system
will easily do something magical to restore all data.

Regards

Zdenek

Gionatan Danti

2017-04-26 07:26:36 UTC

Il 24-04-2017 23:59 Zdenek Kabelac ha scritto:
> If you set '--errorwhenfull y' - it should instantly fail.

It's my understanding that "--errorwhenfull y" will instantly fail
writes which imply new allocation requests, but writes to
already-allocated space will be completed.

It is possible, without messing directly with device mapper (via
dmsetup), to configure a strict "read-only" policy, where *all* writes
(both to allocated or not allocated space) will fail?

It is not possible to do via lvm tools, what/how device-mapper target
should be used?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-26 07:42:46 UTC

Dne 26.4.2017 v 09:26 Gionatan Danti napsal(a):
> Il 24-04-2017 23:59 Zdenek Kabelac ha scritto:
>> If you set '--errorwhenfull y' - it should instantly fail.
>
> It's my understanding that "--errorwhenfull y" will instantly fail writes
> which imply new allocation requests, but writes to already-allocated space
> will be completed.

yes you understand it properly.

>
> It is possible, without messing directly with device mapper (via dmsetup), to
> configure a strict "read-only" policy, where *all* writes (both to allocated
> or not allocated space) will fail?

Nope it's not.

>
> It is not possible to do via lvm tools, what/how device-mapper target should
> be used?

At this moment it's not possible.
I do have some plans/idea how to workaround this in user-space but it's
non-trivial - especially on recovery path.

It would be possible to 'reroute' thin to dm-delay and then write path to
error and read path leave as is - but it's adding many new states to handle,
to ATM it's in queue...

Using 'ext4' with remount-ro is fairly easy to setup and get exactly this
logic.

Gionatan Danti

2017-04-26 08:10:24 UTC

Il 26-04-2017 09:42 Zdenek Kabelac ha scritto:
> At this moment it's not possible.
> I do have some plans/idea how to workaround this in user-space but
> it's non-trivial - especially on recovery path.
>
> It would be possible to 'reroute' thin to dm-delay and then write path
> to error and read path leave as is - but it's adding many new states
> to handle,
> to ATM it's in queue...

Good to know. Thank you.

> Using 'ext4' with remount-ro is fairly easy to setup and get exactly
> this
> logic.

I'm not sure this is sufficient. In my testing, ext4 will *not*
remount-ro on any error, rather only on erroneous metadata updates. For
example, on a thinpool with "--errorwhenfull y", trying to overcommit
data with a simple "dd if=/dev/zero of=/mnt/thinvol bs=1M count=1024
oflag=sync" will cause I/O errors (as shown by dmesg), but the
filesystem is *not* immediately remounted read-only. Rather, after some
time, a failed journal update will remount it read-only.

XFS should behave similarly, with the exception that it will shutdown
the entire filesystem (ie: not even reads are allowed) when metadata
errors are detected (see note n.1).

The problem is that, as filesystem often writes its own metadata to
already-allocated disk space, the out-of-space condition (and relative
filesystem shutdown) will take some time to be recognized.

Note n.1
From RED HAT STORAGE ADMINISTRATION GUIDE
(https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/ch06s09.html#idp17392328):

Metadata error behavior
The ext3/4 file system has configurable behavior when metadata errors
are encountered, with the default being to simply continue. When XFS
encounters a metadata error that is not recoverable it will shut down
the file system and return a EFSCORRUPTED error. The system logs will
contain details of the error enountered and will recommend running
xfs_repair if necessary.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-26 11:23:44 UTC

Dne 26.4.2017 v 10:10 Gionatan Danti napsal(a):
>
> I'm not sure this is sufficient. In my testing, ext4 will *not* remount-ro on
> any error, rather only on erroneous metadata updates. For example, on a
> thinpool with "--errorwhenfull y", trying to overcommit data with a simple "dd
> if=/dev/zero of=/mnt/thinvol bs=1M count=1024 oflag=sync" will cause I/O
> errors (as shown by dmesg), but the filesystem is *not* immediately remounted
> read-only. Rather, after some time, a failed journal update will remount it
> read-only.

You need to use 'direct' write more - otherwise you are just witnessing issues
related with 'page-cache' flushing.

Every update of file means update of journal - so you surely can lose some
data in-flight - but every good software needs to the flush before doing next
transaction - so with correctly working transaction software no data could be
lost.

>
> XFS should behave similarly, with the exception that it will shutdown the
> entire filesystem (ie: not even reads are allowed) when metadata errors are
> detected (see note n.1).

Yep - XFS is slightly different - but it gets improved, however some new
features are not enabled by default and user needs to enabled them.

Regards

Zdenek

Gionatan Danti

2017-04-26 13:37:37 UTC

On 26/04/2017 13:23, Zdenek Kabelac wrote:
>
> You need to use 'direct' write more - otherwise you are just witnessing
> issues related with 'page-cache' flushing.
>
> Every update of file means update of journal - so you surely can lose
> some data in-flight - but every good software needs to the flush before
> doing next transaction - so with correctly working transaction software
> no data could be lost.

I used "oflag=sync" for this very reason - to avoid async writes,
However, let's retry with "oflat=direct,sync".

This is the thinpool before filling:

[***@blackhole mnt]# lvs
LV VG Attr LSize Pool Origin Data% Meta%
Move Log Cpy%Sync Convert
thinpool vg_kvm twi-aot--- 1.00g 87.66 12.01

thinvol vg_kvm Vwi-aot--- 2.00g thinpool 43.83

root vg_system -wi-ao---- 50.00g

swap vg_system -wi-ao---- 7.62g

[***@blackhole storage]# mount | grep thinvol
/dev/mapper/vg_kvm-thinvol on /mnt/storage type ext4
(rw,relatime,seclabel,errors=remount-ro,stripe=32,data=ordered)

Fill the thin volume (note that errors are raised immediately due to
--errorwhenfull=y):

[***@blackhole mnt]# dd if=/dev/zero of=/mnt/storage/test.2 bs=1M
count=300 oflag=direct,sync
dd: error writing ‘/mnt/storage/test.2’: Input/output error
127+0 records in
126+0 records out
132120576 bytes (132 MB) copied, 14.2165 s, 9.3 MB/s

From syslog:

Apr 26 15:26:24 localhost lvm[897]: WARNING: Thin pool
vg_kvm-thinpool-tpool data is now 96.84% full.
Apr 26 15:26:27 localhost kernel: device-mapper: thin: 253:4: reached
low water mark for data device: sending event.
Apr 26 15:26:27 localhost kernel: device-mapper: thin: 253:4: switching
pool to out-of-data-space (error IO) mode
Apr 26 15:26:34 localhost lvm[897]: WARNING: Thin pool
vg_kvm-thinpool-tpool data is now 100.00% full.

Despite write errors, the filesystem is not in read-only mode:

[***@blackhole mnt]# touch /mnt/storage/test.txt; sync; ls -al
/mnt/storage
total 948248
drwxr-xr-x. 3 root root 4096 26 apr 15.27 .
drwxr-xr-x. 6 root root 51 20 apr 15.23 ..
drwx------. 2 root root 16384 26 apr 15.24 lost+found
-rw-r--r--. 1 root root 838860800 26 apr 15.25 test.1
-rw-r--r--. 1 root root 132120576 26 apr 15.26 test.2
-rw-r--r--. 1 root root 0 26 apr 15.27 test.txt

I can even recover free space via fstrim:

[***@blackhole mnt]# rm /mnt/storage/test.1; sync
rm: remove regular file ‘/mnt/storage/test.1’? y
[***@blackhole mnt]# fstrim -v /mnt/storage/
/mnt/storage/: 828 MiB (868204544 bytes) trimmed
[***@blackhole mnt]# lvs
LV VG Attr LSize Pool Origin Data% Meta%
Move Log Cpy%Sync Convert
thinpool vg_kvm twi-aot--- 1.00g 21.83 3.71
thinvol vg_kvm Vwi-aot--- 2.00g thinpool 10.92
root vg_system -wi-ao---- 50.00g
swap vg_system -wi-ao---- 7.62g

From syslog:
Apr 26 15:34:15 localhost kernel: device-mapper: thin: 253:4: switching
pool to write mode

To me, it seems that metadata updates completed because they hit the
already-allocated disk space, not triggering the remount-ro code. I am
missing something?

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-26 14:33:15 UTC

Dne 26.4.2017 v 15:37 Gionatan Danti napsal(a):
>
> On 26/04/2017 13:23, Zdenek Kabelac wrote:
>>
>> You need to use 'direct' write more - otherwise you are just witnessing
>> issues related with 'page-cache' flushing.
>>
>> Every update of file means update of journal - so you surely can lose
>> some data in-flight - but every good software needs to the flush before
>> doing next transaction - so with correctly working transaction software
>> no data could be lost.
>
> I used "oflag=sync" for this very reason - to avoid async writes, However,
> let's retry with "oflat=direct,sync".
>
> This is the thinpool before filling:
>
> [***@blackhole mnt]# lvs
> LV VG Attr LSize Pool Origin Data% Meta% Move Log
> Cpy%Sync Convert
> thinpool vg_kvm twi-aot--- 1.00g 87.66 12.01
> thinvol vg_kvm Vwi-aot--- 2.00g thinpool 43.83
> root vg_system -wi-ao---- 50.00g
> swap vg_system -wi-ao---- 7.62g
>
> [***@blackhole storage]# mount | grep thinvol
> /dev/mapper/vg_kvm-thinvol on /mnt/storage type ext4
> (rw,relatime,seclabel,errors=remount-ro,stripe=32,data=ordered)
>
>
> Fill the thin volume (note that errors are raised immediately due to
> --errorwhenfull=y):
>
> [***@blackhole mnt]# dd if=/dev/zero of=/mnt/storage/test.2 bs=1M count=300
> oflag=direct,sync
> dd: error writing ‘/mnt/storage/test.2’: Input/output error
> 127+0 records in
> 126+0 records out
> 132120576 bytes (132 MB) copied, 14.2165 s, 9.3 MB/s
>
> From syslog:
>
> Apr 26 15:26:24 localhost lvm[897]: WARNING: Thin pool vg_kvm-thinpool-tpool
> data is now 96.84% full.
> Apr 26 15:26:27 localhost kernel: device-mapper: thin: 253:4: reached low
> water mark for data device: sending event.
> Apr 26 15:26:27 localhost kernel: device-mapper: thin: 253:4: switching pool
> to out-of-data-space (error IO) mode
> Apr 26 15:26:34 localhost lvm[897]: WARNING: Thin pool vg_kvm-thinpool-tpool
> data is now 100.00% full.
>
> Despite write errors, the filesystem is not in read-only mode:

But you get correct 'write' error - so from application POV - you get failing
transaction update/write - so app knows 'data' were lost and should not
proceed with next transaction - so it's in line with 'no data is lost' and
filesystem is not damaged and is in correct state (mountable).

Zdenek

Gionatan Danti

2017-04-26 16:37:37 UTC

Il 26-04-2017 16:33 Zdenek Kabelac ha scritto:
> But you get correct 'write' error - so from application POV - you get
> failing
> transaction update/write - so app knows 'data' were lost and should
> not proceed with next transaction - so it's in line with 'no data is
> lost' and filesystem is not damaged and is in correct state
> (mountable).

True, but the case exists that, even on a full pool, an application with
multiple outstanding writes will have some of them completed/commited
while other get I/O error, as writes to already allocated space are
permitted while writes to non-allocated space are failed. If, for
example, I overwrite some already-allocated files, writes will be
committed even if the pool is completely full.

In past discussion, I had the impression that the only filesystem you
feel safe with thinpool is ext4 + remount-ro, on the assumption that
*any* failed writes will trigger the read-only mode. But from my test it
seems that only *failed metadata updates* trigger the read-only mode. If
this is really the case, remount-ro really is a mandatory option.
However, as metadata can reside on alredy-allocated blocks, even of a
full pool they have a chance to be committed, without triggering the
remount-ro.

At the same time, I thought that you consider the thinpool + xfs combo
somewhat "risky", as xfs does not have a remount-ro option. Actually,
xfs seems to *always* shutdown the filesystem in case of failed metadata
update.

Maybe I misunderstood some yours message; in this case, sorry for that.

Anyway, I think (and maybe I am wrong...) that the better solution is to
fail *all* writes to a full pool, even the ones directed to allocated
space. This will effectively "freeze" the pool and avoid any
long-standing inconsistencies.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Stuart Gathman

2017-04-26 18:32:49 UTC

On 04/26/2017 12:37 PM, Gionatan Danti wrote:
>
> Anyway, I think (and maybe I am wrong...) that the better solution is
> to fail *all* writes to a full pool, even the ones directed to
> allocated space. This will effectively "freeze" the pool and avoid any
> long-standing inconsistencies.
+1 This is what I have been advocating also

Stuart Gathman

2017-04-26 19:24:54 UTC

On 04/26/2017 12:37 PM, Gionatan Danti wrote:
>
> Anyway, I think (and maybe I am wrong...) that the better solution is
> to fail *all* writes to a full pool, even the ones directed to
> allocated space. This will effectively "freeze" the pool and avoid any
> long-standing inconsistencies.
Or slightly better: fail *all* writes to a full pool after the *first*
write to an unallocated area. That way, operation can continue a little
longer without risking inconsistency so long as all writes are to
allocated areas.

Gionatan Danti

2017-05-02 11:00:37 UTC

On 26/04/2017 18:37, Gionatan Danti wrote:
> True, but the case exists that, even on a full pool, an application with
> multiple outstanding writes will have some of them completed/commited
> while other get I/O error, as writes to already allocated space are
> permitted while writes to non-allocated space are failed. If, for
> example, I overwrite some already-allocated files, writes will be
> committed even if the pool is completely full.
>
> In past discussion, I had the impression that the only filesystem you
> feel safe with thinpool is ext4 + remount-ro, on the assumption that
> *any* failed writes will trigger the read-only mode. But from my test it
> seems that only *failed metadata updates* trigger the read-only mode. If
> this is really the case, remount-ro really is a mandatory option.
> However, as metadata can reside on alredy-allocated blocks, even of a
> full pool they have a chance to be committed, without triggering the
> remount-ro.
>
> At the same time, I thought that you consider the thinpool + xfs combo
> somewhat "risky", as xfs does not have a remount-ro option. Actually,
> xfs seems to *always* shutdown the filesystem in case of failed metadata
> update.
>
> Maybe I misunderstood some yours message; in this case, sorry for that.
>
> Anyway, I think (and maybe I am wrong...) that the better solution is to
> fail *all* writes to a full pool, even the ones directed to allocated
> space. This will effectively "freeze" the pool and avoid any
> long-standing inconsistencies.
>
> Thanks.
>

Hi Zdeneck, I would *really* to hear back you on these questions.
Can we consider thinlvm + xfs as safe as thinlvm + ext4 ?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Gionatan Danti

2017-05-12 13:02:58 UTC

On 02/05/2017 13:00, Gionatan Danti wrote:
>
>
> On 26/04/2017 18:37, Gionatan Danti wrote:
>> True, but the case exists that, even on a full pool, an application with
>> multiple outstanding writes will have some of them completed/commited
>> while other get I/O error, as writes to already allocated space are
>> permitted while writes to non-allocated space are failed. If, for
>> example, I overwrite some already-allocated files, writes will be
>> committed even if the pool is completely full.
>>
>> In past discussion, I had the impression that the only filesystem you
>> feel safe with thinpool is ext4 + remount-ro, on the assumption that
>> *any* failed writes will trigger the read-only mode. But from my test it
>> seems that only *failed metadata updates* trigger the read-only mode. If
>> this is really the case, remount-ro really is a mandatory option.
>> However, as metadata can reside on alredy-allocated blocks, even of a
>> full pool they have a chance to be committed, without triggering the
>> remount-ro.
>>
>> At the same time, I thought that you consider the thinpool + xfs combo
>> somewhat "risky", as xfs does not have a remount-ro option. Actually,
>> xfs seems to *always* shutdown the filesystem in case of failed metadata
>> update.
>>
>> Maybe I misunderstood some yours message; in this case, sorry for that.
>>
>> Anyway, I think (and maybe I am wrong...) that the better solution is to
>> fail *all* writes to a full pool, even the ones directed to allocated
>> space. This will effectively "freeze" the pool and avoid any
>> long-standing inconsistencies.
>>
>> Thanks.
>>
>
> Hi Zdeneck, I would *really* to hear back you on these questions.
> Can we consider thinlvm + xfs as safe as thinlvm + ext4 ?
>
> Thanks.
>

Hi all and sorry for the bump...
Anyone with some comments on these questions?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Joe Thornber

2017-05-12 13:42:02 UTC

On Fri, May 12, 2017 at 03:02:58PM +0200, Gionatan Danti wrote:
> On 02/05/2017 13:00, Gionatan Danti wrote:
> >>Anyway, I think (and maybe I am wrong...) that the better solution is to
> >>fail *all* writes to a full pool, even the ones directed to allocated
> >>space. This will effectively "freeze" the pool and avoid any
> >>long-standing inconsistencies.

I think dm-thin behaviour is fine given the semantics of write
and flush IOs.

A block device can complete a write even if it hasn't hit the physical
media, a flush request needs to come in at a later time which means
'flush all IOs that you've previously completed'. So any software using
a block device (fs, database etc), tends to generate batches of writes,
followed by a flush to commit the changes. For example if there was a
power failure between the batch of write io completing and the flush
completing you do not know how much of the writes will be visible when
the machine comes back.

When a pool is full it will allow writes to provisioned areas of a thin to
succeed. But if any writes failed due to inability to provision then a
REQ_FLUSH io to that thin device will *not* succeed.

- Joe

Gionatan Danti

2017-05-14 20:39:21 UTC

Il 12-05-2017 15:42 Joe Thornber ha scritto:
> On Fri, May 12, 2017 at 03:02:58PM +0200, Gionatan Danti wrote:
>> On 02/05/2017 13:00, Gionatan Danti wrote:
>> >>Anyway, I think (and maybe I am wrong...) that the better solution is to
>> >>fail *all* writes to a full pool, even the ones directed to allocated
>> >>space. This will effectively "freeze" the pool and avoid any
>> >>long-standing inconsistencies.
>
> I think dm-thin behaviour is fine given the semantics of write
> and flush IOs.
>
> A block device can complete a write even if it hasn't hit the physical
> media, a flush request needs to come in at a later time which means
> 'flush all IOs that you've previously completed'. So any software
> using
> a block device (fs, database etc), tends to generate batches of writes,
> followed by a flush to commit the changes. For example if there was a
> power failure between the batch of write io completing and the flush
> completing you do not know how much of the writes will be visible when
> the machine comes back.
>
> When a pool is full it will allow writes to provisioned areas of a thin
> to
> succeed. But if any writes failed due to inability to provision then a
> REQ_FLUSH io to that thin device will *not* succeed.
>
> - Joe

True, but the real problem is that most of the failed flushes will *not*
bring the filesystem read-only, as both ext4 and xfs seems to go
read-only only when *metadata* updates fail. As this very same list
recommend using ext4 with errors=remount-ro on the basis that putting
the filesystem in a read-only state after any error I the right thing, I
was somewhat alarmed to find that, as far I can tell, ext4 goes
read-only on metadata errors only.

So, let me reiterate: can we consider thinlvm + xfs as safe as thinlvm +
ext4 + errors=remount-ro?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-05-15 12:50:52 UTC

Dne 14.5.2017 v 22:39 Gionatan Danti napsal(a):
> Il 12-05-2017 15:42 Joe Thornber ha scritto:
>> On Fri, May 12, 2017 at 03:02:58PM +0200, Gionatan Danti wrote:
>>> On 02/05/2017 13:00, Gionatan Danti wrote:
>>> >>Anyway, I think (and maybe I am wrong...) that the better solution is to
>>> >>fail *all* writes to a full pool, even the ones directed to allocated
>>> >>space. This will effectively "freeze" the pool and avoid any
>>> >>long-standing inconsistencies.
>>
>> I think dm-thin behaviour is fine given the semantics of write
>> and flush IOs.
>>
>> A block device can complete a write even if it hasn't hit the physical
>> media, a flush request needs to come in at a later time which means
>> 'flush all IOs that you've previously completed'. So any software using
>> a block device (fs, database etc), tends to generate batches of writes,
>> followed by a flush to commit the changes. For example if there was a
>> power failure between the batch of write io completing and the flush
>> completing you do not know how much of the writes will be visible when
>> the machine comes back.
>>
>> When a pool is full it will allow writes to provisioned areas of a thin to
>> succeed. But if any writes failed due to inability to provision then a
>> REQ_FLUSH io to that thin device will *not* succeed.
>>
>> - Joe
>
> True, but the real problem is that most of the failed flushes will *not* bring
> the filesystem read-only, as both ext4 and xfs seems to go read-only only when
> *metadata* updates fail. As this very same list recommend using ext4 with
> errors=remount-ro on the basis that putting the filesystem in a read-only
> state after any error I the right thing, I was somewhat alarmed to find that,
> as far I can tell, ext4 goes read-only on metadata errors only.
>
> So, let me reiterate: can we consider thinlvm + xfs as safe as thinlvm + ext4
> + errors=remount-ro?

Hi

I still think you are mixing apples & oranges together and you expecting
answer '42' :)

There is simply NO simple answer. Every case has its pros & cons.

There is simply cases where XFS beats Ext4 and there are opposite situations
as well.

Also you WILL always get WRITE error - if your application doesn't care about
write error - why do you expect any block-device logic could rescue you ??

Out-of-space thin-pool is simply a device which looks like seriously damaged
disk where you always read something without any problem and you fail to write
things here and there.

IMHO both filesystem XFS & Ext4 on recent kernels do work well - but no one
can say there are no problems at all.

Things are getting better - but planning usage of thin-pool to 'recover'
overfilled pool is simple BAD planning. You should plan your thin-pool usage
to NOT run out-of-space.

And last comment I always say - full thin-pool it not similar to full
filesystem where you drop some 'large' file and you are happily working again
- it's not working this way - and if someone hoped into this - he needs to use
something else ATM.

Regards

Zdenek

Gionatan Danti

2017-05-15 14:48:17 UTC

On 15/05/2017 14:50, Zdenek Kabelac wrote> Hi
>
> I still think you are mixing apples & oranges together and you expecting
> answer '42' :)

'42' would be the optimal answer :p

> There is simply NO simple answer. Every case has its pros & cons.
>
> There is simply cases where XFS beats Ext4 and there are opposite
> situations as well.

Maybe I'm too naive, but I have an hard time grasping all the
implication of this sentence.

I fully understand that, currently, a full thinp is basically a "damaged
disk", where some writes can complete (good/provisioned zones) and some
fail (bad/unprovisioned zones). I also read the device-mapper docs and I
understand that, currently, a "fail all writes but let reads succeed"
target does not exists.

What I does not understand is how XFS and EXT4 differs when a thinp is
full. From a previous your reply, after I asked how to put thinp in read
only mode when full:

"Using 'ext4' with remount-ro is fairly easy to setup and get exactly
this logic."

My naive interpretation is that when EXT4 detects *any* I/O error, it
will set the filesystem in read-only mode. Except that my tests show
that only failed *metadata* update put the filesystem in this state. The
bad thingh is that, when not using "remount-ro", even failed metadata
updates will *not* trigger any read-only response.

In short, I am right saying that EXT4 should *always* be used with
"remount-ro" when stacked on top of a thinp?

On the other hand, XFS has not such options but it, by default, ensures
that failed *metadata* updates will stop the filesystem. Even reads are
not allowed (to regain read access, you need to repair the filesystem or
mount it with "ro,norecovery").

So, it should be even safer than EXT4, right? Or do you feel that is the
other way around? If so, why?

> Things are getting better - but planning usage of thin-pool to
> 'recover' overfilled pool is simple BAD planning. You should plan your
> thin-pool usage to NOT run out-of-space.

Sure, and I am *not* planning for it. But as bad things always happen,
I'm preparing for them ;)

> And last comment I always say - full thin-pool it not similar to full
> filesystem where you drop some 'large' file and you are happily working
> again - it's not working this way - and if someone hoped into this - he
> needs to use something else ATM.

Absolutely.

Sorry if I seem pedantic, I am genuinely try to understand.
Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-05-15 15:33:19 UTC

Dne 15.5.2017 v 16:48 Gionatan Danti napsal(a):
> On 15/05/2017 14:50, Zdenek Kabelac wrote> Hi
>>
> What I does not understand is how XFS and EXT4 differs when a thinp is full.
> From a previous your reply, after I asked how to put thinp in read only mode
> when full:
>
> "Using 'ext4' with remount-ro is fairly easy to setup and get exactly this
> logic."
>
> My naive interpretation is that when EXT4 detects *any* I/O error, it will set
> the filesystem in read-only mode. Except that my tests show that only failed
> *metadata* update put the filesystem in this state. The bad thingh is that,
> when not using "remount-ro", even failed metadata updates will *not* trigger
> any read-only response.

Ever tested this:

mount -o errors=remount-ro,data=journal ?

Everything has it's price - you want to have also 'safe' data - well you have
to pay the price.

> On the other hand, XFS has not such options but it, by default, ensures that
> failed *metadata* updates will stop the filesystem. Even reads are not allowed
> (to regain read access, you need to repair the filesystem or mount it with
> "ro,norecovery").
>
> So, it should be even safer than EXT4, right? Or do you feel that is the other
> way around? If so, why?

I prefer 'remount-ro' as the FS is still at least accessible/usable in some way.

>
>> Things are getting better - but planning usage of thin-pool to 'recover'
>> overfilled pool is simple BAD planning. You should plan your thin-pool usage
>> to NOT run out-of-space.
>
> Sure, and I am *not* planning for it. But as bad things always happen, I'm
> preparing for them ;)

When you have extra space you can add for recovery - it's usually easy.
But you will have much harder time doing recovery without extra space.

So again - all has its price....

Regards

Zdenek

Gionatan Danti

2017-05-16 07:53:33 UTC

On 15/05/2017 17:33, Zdenek Kabelac wrote:> Ever tested this:
>
> mount -o errors=remount-ro,data=journal ?

Yes, I tested it - same behavior: a full thinpool does *not* immediately
put the filesystem in a read-only state, even when using sync/fsync and
"errorwhenfull=y".

So, it seems EXT4 remounts in read-only mode only when *metadata*
updates fail.

> I prefer 'remount-ro' as the FS is still at least accessible/usable in
> some way.

Fair enought.

>>> Things are getting better

Can you make an example?

> So again - all has its price....

True ;)

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-05-16 10:54:01 UTC

Dne 16.5.2017 v 09:53 Gionatan Danti napsal(a):
> On 15/05/2017 17:33, Zdenek Kabelac wrote:> Ever tested this:
>>
>> mount -o errors=remount-ro,data=journal ?
>
> Yes, I tested it - same behavior: a full thinpool does *not* immediately put
> the filesystem in a read-only state, even when using sync/fsync and
> "errorwhenfull=y".

Hi

Somehow I think you've rather made a mistake during your test (or you have
buggy kernel). Can you take full log of your test show all options are
properly applied

i.e. dmesg log + /proc/self/mountinfo report showing all options used for
mountpoint and kernel version in use.

IMHO you should get something like this in dmesg once your pool gets out of
space and starts to return error on write:

----
Aborting journal on device dm-4-8.
EXT4-fs error (device dm-4): ext4_journal_check_start:60: Detected aborted journal
EXT4-fs (dm-4): Remounting filesystem read-only
----

Clearly when you specify 'data=journal' even write failure of data will cause
journal error and thus remount-ro reaction (it least on my box does it) - but
such usage is noticeable slower compared with 'ordered' mode.

Regards

Zdenek

Gionatan Danti

2017-05-16 13:38:56 UTC

On 16/05/2017 12:54, Zdenek Kabelac wrote:
>
> Hi
>
> Somehow I think you've rather made a mistake during your test (or you
> have buggy kernel). Can you take full log of your test show all options
> are
> properly applied
>
> i.e. dmesg log + /proc/self/mountinfo report showing all options used
> for mountpoint and kernel version in use.
>
> IMHO you should get something like this in dmesg once your pool gets out
> of space and starts to return error on write:
>
> ----
> Aborting journal on device dm-4-8.
> EXT4-fs error (device dm-4): ext4_journal_check_start:60: Detected
> aborted journal
> EXT4-fs (dm-4): Remounting filesystem read-only
> ----
>
>
> Clearly when you specify 'data=journal' even write failure of data will
> cause journal error and thus remount-ro reaction (it least on my box
> does it) - but such usage is noticeable slower compared with 'ordered'
> mode.

Zdenek, you are right: re-executing the test, I now see the following
dmesg entries:

[ 1873.677882] Aborting journal on device dm-6-8.
[ 1873.757170] EXT4-fs error (device dm-6): ext4_journal_check_start:56:
Detected aborted journal
[ 1873.757184] EXT4-fs (dm-6): Remounting filesystem read-only

At the same time, looking at bash history and /var/log/messages it
*seems* that I did nothing wrong with previous tests. I'll do more tests
and post here if I find something relevant.

Thanks for your time and patience.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2018-02-27 18:39:44 UTC

Zdenek Kabelac schreef op 24-04-2017 23:59:

>>> I'm just currious - what the you think will happen when you have
>>> root_LV as thin LV and thin pool runs out of space - so 'root_LV'
>>> is replaced with 'error' target.
>>
>> Why do you suppose Root LV is on thin?
>>
>> Why not just stick to the common scenario when thin is used for extra
>> volumes or data?
>>
>> I mean to say that you are raising an exceptional situation as an
>> argument against something that I would consider quite common, which
>> doesn't quite work that way: you can't prove that most people would
>> not want something by raising something most people wouldn't use.
>>
>> I mean to say let's just look at the most common denominator here.
>>
>> Root LV on thin is not that.
>
> Well then you might be surprised - there are user using exactly this.

I am sorry, this is a long time ago.

I was concerned with thin full behaviour and I guess I was concerned
with being able to limit thin snapshot sizes.

I said that application failure was acceptable, but system failure not.

Then you brought up root on thin as a way of "upping the ante".

I contended that this is a bigger problem to tackle, but it shouldn't
mean you shouldn't tackle the smaller problems.

(The smaller problem being data volumes).

Even if root is on thin and you are using it for snapshotting, it would
be extremely unwise to overprovision such a thing or to depend on
"additional space" being added by the admin; root filesystems are not
meant to be expandable.

If on the other hand you do count on overprovisioning (due to snapshots)
then being able to limit snapshot size becomes even more important.

> When you have rootLV on thinLV - you could easily snapshot it before
> doing any upgrade and revert back in case something fails on upgrade.
> See also projects like snapper...

True enough, but if you risk filling your pool because you don't have
full room for a full snapshot, that would be extremely unwise. I'm also
not sure write performance for a single snapshot is very much different
between thin and non-thin?

They are both CoW. E.g. you write to an existing block it has to be
duplicated, only for non-allocated writes thin is faster, right?

I simply cannot reconcile an attitude that thin-full-risk is acceptable
and the admin's job while at the same time advocating it for root
filesystems.

Now most of this thread I was under the impression that "SYSTEM HANGS"
where the norm because that's the only thing I ever experienced (kernel
3.x and kernel 4.4 back then), however you said that this was fixed in
later kernels.

So given that, some of the disagreement here was void as apparently no
one advocated that these hangs were acceptable ;-).

:).

>> I have tried it, yes. Gives troubles with Grub and requires thin
>> package to be installed on all systems and makes it harder to install
>> a system too.
>
> lvm2 is cooking some better boot support atm....

Grub-probe couldn't find the root volume so I had to maintain my own
grub.cfg.

Regardless if I ever used this again I would take care to never
overprovision or to only overprovision at low risk with respect to
snapshots.

Ie. you could thin provision root + var or something similar but I would
always put data volumes (home etc) elsewhere.

Ie. not share the same pool.

Currently I was using a regular snapshot but I allocated it too small
and it always got dropped much faster than I anticipated.

(A 1GB snapshot constantly filling up with even minor upgrade
operations).

>> Thin root LV is not the idea for most people.
>>
>> So again, don't you think having data volumes produce errors is not
>> preferable to having the entire system hang?
>
> Not sure why you insist system hangs.
>
> If system hangs - and you have recent kernel & lvm2 - you should fill
> bug.
>
> If you set '--errorwhenfull y' - it should instantly fail.
>
> There should not be any hanging..

Right well Debian Jessie and Ubuntu Xenial just experienced that.

>> That's irrelevant; if the thin pool is full you need to mitigate it,
>> rebooting won't help with that.
>
> well it's really admins task to solve the problem after panic call.
> (adding new space).

That's a lot easier if your root filesystem doesn't lock up.

;-).

Good luck booting to some rescue environment on a VPS or with some boot
stick on a PC; the Ubuntu rescue environment for instance has been
abysmal since SystemD.

You can't actually use the rescue environment because there is some
weird interaction with systemd spewing messages and causing weird
behaviour on the TTY you are supposed to work on.

Initrd yes, but not the "full rescue" systemd target, doesn't work.

My point with this thread was.....

When my root snapshot fills up and gets dropped, I lose my undo history,
but at least my root filesystem won't lock up.

I just calculated the size too small and I am sure I can also put a
snapshot IN a thin pool for a non-thin root volume?

Haven't tried.

However, I don't have the space for a full copy of every filesystem, so
if I snapshot, I will automatically overprovision.

My snapshots are indeed meant for backups (of data volumes) ---- not for
rollback ----- and for rollback ----- but only for the root filesystem.

So: my thin snapshots are meant for backup,
my root snapshot (non-thin) is meant for rollback.

But, if any application really misbehaved... previously the entire
system would crash (kernel 3.x).

So, the only defense is constant monitoring and emails or even tty/pty
broadcasts because
well sometimes it is just human error where you copy the wrong thing to
the wrong place.

Because I cannot limit my (backup) snapshots in size.

With sufficient monitoring I guess that is not much of an issue.

> Thin users can't expect to overload system in crazy way and expect the
> system will easily do something magical to restore all data.

That was never asked.

My problem was system hangs, but my question was about limiting snapshot
size on thin.

However userspace response scripts were obviously possible.....

Including those that would prioritize dropping thin snapshots over other
measures.

Zdenek Kabelac

2018-02-28 09:26:44 UTC

Dne 27.2.2018 v 19:39 Xen napsal(a):
> Zdenek Kabelac schreef op 24-04-2017 23:59:
>
>>>> I'm just currious - what the you think will happen when you have
>>>> root_LV as thin LV and thin pool runs out of space - so 'root_LV'
>>>> is replaced with 'error' target.
>>>
>>> Why do you suppose Root LV is on thin?
>>>
>>> Why not just stick to the common scenario when thin is used for extra
>>> volumes or data?
>>>
>>> I mean to say that you are raising an exceptional situation as an argument
>>> against something that I would consider quite common, which doesn't quite
>>> work that way: you can't prove that most people would not want something by
>>> raising something most people wouldn't use.
>>>
>>> I mean to say let's just look at the most common denominator here.
>>>
>>> Root LV on thin is not that.
>>
>> Well then you might be surprised - there are user using exactly this.
>
> I am sorry, this is a long time ago.
>
> I was concerned with thin full behaviour and I guess I was concerned with
> being able to limit thin snapshot sizes.
>
> I said that application failure was acceptable, but system failure not.

Hi

I'll probably repeat my self again, but thin provision can't be responsible
for all kernel failures. There is no way DM team can fix all the related paths
on this road.

If you don't plan to help resolving those issue - there is not point in
complaining over and over again - we are already well aware of this issues...

Admin needs to be aware of 'pros & cons' and have to use thin technology at
the right place for the right task.

If the admin can't stand failing system, he can't use thin-p.

Overprovisioning on DEVICE level simply IS NOT equivalent to full filesystem
like you would like to see all the time here and you've been already many
times explained that filesystems are simply not there ready - fixes are on
going but it will take its time and it's really pointless to exercise this on
2-3 year old kernels...

Thin provisioning has it's use case and it expects admin is well aware of
possible problems.

If you are aiming for a magic box working always right - stay away from thin-p
- the best advice....

> Even if root is on thin and you are using it for snapshotting, it would be
> extremely unwise to overprovision such a thing or to depend on "additional
> space" being added by the admin; root filesystems are not meant to be expandable.

Do NOT take thin snapshot of your root filesystem so you will avoid thin-pool
overprovisioning problem.

> True enough, but if you risk filling your pool because you don't have full
> room for a full snapshot, that would be extremely unwise. I'm also not sure
> write performance for a single snapshot is very much different between thin
> and non-thin?

Rule #1:

Thin-pool was never targeted for 'regular' usage of full thin-pool.
Full thin-pool is serious ERROR condition with bad/ill effects on systems.
Thin-pool was designed to 'delay/postpone' real space usage - aka you can use
more 'virtual' space with the promise you deliver real storage later.

So if you have different goals - like having some kind of full equivalency
logic to full filesystem - you need to write different target....

> I simply cannot reconcile an attitude that thin-full-risk is acceptable and
> the admin's job while at the same time advocating it for root filesystems.

Do NOT use thin-provinioning - as it's not meeting your requirements.

> Now most of this thread I was under the impression that "SYSTEM HANGS" where
> the norm because that's the only thing I ever experienced (kernel 3.x and
> kernel 4.4 back then), however you said that this was fixed in later kernels.

Big news - we are at ~4.16 kernel upstream - so noone is really taking much
care about 4.4 troubles here - sorry about that....

Speaking of 4.4 - I'd generally advice to jump to higher versions of kernel
ASAP - since 4.4 has some known bad behavior in the case thin-pool 'metadata'
get overfilled.

>> lvm2 is cooking some better boot support atm....
>
> Grub-probe couldn't find the root volume so I had to maintain my own grub.cfg.

There is on going 'BOOM' project - check it out please....

>> There should not be any hanging..
>
> Right well Debian Jessie and Ubuntu Xenial just experienced that.

There is not much point in commenting support for some old distros other then
you really should try harder with your distro maintainers....

>>> That's irrelevant; if the thin pool is full you need to mitigate it,
>>> rebooting won't help with that.
>>
>> well it's really admins task to solve the problem after panic call.
>> (adding new space).
>
> That's a lot easier if your root filesystem doesn't lock up.

- this is not really a fault of dm thin-provisioning kernel part.
- on going fixes to file systems are being pushed upstream (for years).
- fixes will not appear in years old kernels as such patches are usually
invasive so unless you use pay someone to do the backporting job the easiest
way forward is to user newer improved kernel..

> When my root snapshot fills up and gets dropped, I lose my undo history, but
> at least my root filesystem won't lock up.

lvm2 fully support these snapshots as well as thin-snapshots.
Admin has to choose 'the best fit'

ATM thin-pool can't deliver equivalent logic - just like old-snaps can't
deliver thin-pool logic.

> However, I don't have the space for a full copy of every filesystem, so if I
> snapshot, I will automatically overprovision.

Back to rule #1 - thin-p is about 'delaying' deliverance of real space.
If you already have plan to never deliver promised space - you need to live
with consequences....

> My snapshots are indeed meant for backups (of data volumes) ---- not for
> rollback ----- and for rollback ----- but only for the root filesystem.

There is more fundamental problem here:

!SNAPSHOTS ARE NOT BACKUPS!

This is the key problem with your thinking here (unfortunately you are not
'alone' with this thinking)

> With sufficient monitoring I guess that is not much of an issue.

We do provide quite good 'scripting' support for this case - but again if
the system can't crash - you can't use thin-pool for your root LV or you can't
use over-provisioning.

> My problem was system hangs, but my question was about limiting snapshot size
> on thin.

Well your problem primarily is usage of too old system....

Sorry to say this - but if you insist to stick with old system - ask your
distro maintainers to do all the backporting work for you - this is nothing
lvm2 can help with...

Regards

Zdenek

Gionatan Danti

2018-02-28 19:07:08 UTC

Hi all,

Il 28-02-2018 10:26 Zdenek Kabelac ha scritto:
> Overprovisioning on DEVICE level simply IS NOT equivalent to full
> filesystem like you would like to see all the time here and you've
> been already many times explained that filesystems are simply not
> there ready - fixes are on going but it will take its time and it's
> really pointless to exercise this on 2-3 year old kernels...

this was really beaten to death in the past months/threads. I generally
agree with Zedenk.

To recap (Zdeneck, correct me if I am wrong): the main problem is that,
on a full pool, async writes will more-or-less silenty fail (with errors
shown on dmesg, but nothing more). Another possible cause of problem is
that, even on a full pool, *some* writes will complete correctly (the
one on already allocated chunks).

In the past was argued that putting the entire pool in read-only mode
(where *all* writes fail, but read are permitted to complete) would be a
better fail-safe mechanism; however, it was stated that no current
dmtarget permit that.

Two (good) solution where given, both relying on scripting (see
"thin_command" option on lvm.conf):
- fsfreeze on a nearly full pool (ie: >=98%);
- replace the dmthinp target with the error target (using dmsetup).

I really think that with the good scripting infrastructure currently
built in lvm this is a more-or-less solved problem.

> Do NOT take thin snapshot of your root filesystem so you will avoid
> thin-pool overprovisioning problem.

But is someone *really* pushing thinp for root filesystem? I always used
it for data partition only... Sure, rollback capability on root is nice,
but it is on data which they are *really* important.

> Thin-pool was never targeted for 'regular' usage of full thin-pool.
> Full thin-pool is serious ERROR condition with bad/ill effects on
> systems.
> Thin-pool was designed to 'delay/postpone' real space usage - aka you
> can use more 'virtual' space with the promise you deliver real storage
> later.

In stress testing, I never saw a system crash on a full thin pool, but I
was not using it on root filesystem. There are any ill effect on system
stability which I need to know?

>> When my root snapshot fills up and gets dropped, I lose my undo
>> history, but at least my root filesystem won't lock up.

We discussed that in the past also, but as snapshot volumes really are
*regular*, writable volumes (which a 'k' flag to skip activation by
default), the LVM team take the "safe" stance to not automatically drop
any volume.

The solution is to use scripting/thin_command with lvm tags. For
example:
- tag all snapshot with a "snap" tag;
- when usage is dangerously high, drop all volumes with "snap" tag.

>> However, I don't have the space for a full copy of every filesystem,
>> so if I snapshot, I will automatically overprovision.
>
> Back to rule #1 - thin-p is about 'delaying' deliverance of real space.
> If you already have plan to never deliver promised space - you need to
> live with consequences....

I am not sure to 100% agree on that. Thinp is not only about "delaying"
space provisioning; it clearly is also (mostly?) about fast, modern,
usable snapshots. Docker, snapper, stratis, etc. all use thinp mainly
for its fast, efficent snapshot capability. Denying that is not so
useful and led to "overwarning" (ie: when snapshotting a volume on a
virtually-fillable thin pool).

>
> !SNAPSHOTS ARE NOT BACKUPS!
>
> This is the key problem with your thinking here (unfortunately you are
> not 'alone' with this thinking)

Snapshot are not backups, as they do not protect from hardware problems
(and denying that would be lame); however, they are an invaluable *part*
of a successfull backup strategy. Having multiple rollaback target, even
on the same machine, is a very usefull tool.

> We do provide quite good 'scripting' support for this case - but again
> if
> the system can't crash - you can't use thin-pool for your root LV or
> you can't use over-provisioning.

Again, I don't understand by we are speaking about system crashes. On
root *not* using thinp, I never saw a system crash due to full data
pool.

Oh, and I use thinp on RHEL/CentOS only (Debian/Ubuntu backports are way
too limited).

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2018-02-28 21:43:26 UTC

Dne 28.2.2018 v 20:07 Gionatan Danti napsal(a):
> Hi all,
>
> Il 28-02-2018 10:26 Zdenek Kabelac ha scritto:
>> Overprovisioning on DEVICE level simply IS NOT equivalent to full
>> filesystem like you would like to see all the time here and you've
>> been already many times explained that filesystems are simply not
>> there ready - fixes are on going but it will take its time and it's
>> really pointless to exercise this on 2-3 year old kernels...
>
> this was really beaten to death in the past months/threads. I generally agree
> with Zedenk.
>
> To recap (Zdeneck, correct me if I am wrong): the main problem is that, on a
> full pool, async writes will more-or-less silenty fail (with errors shown on
> dmesg, but nothing more). Another possible cause of problem is that, even on a
> full pool, *some* writes will complete correctly (the one on already allocated
> chunks).

On default - full pool starts to 'error' all 'writes' in 60 seconds.

>
> In the past was argued that putting the entire pool in read-only mode (where
> *all* writes fail, but read are permitted to complete) would be a better
> fail-safe mechanism; however, it was stated that no current dmtarget permit that.

Yep - I'd probably like to see slightly different mechanism - that all
on going writes would be failing - so far - some 'writes' will pass
(those to already provisioned areas) - some will fail (those to unprovisioned).

The main problem is - after reboot - this 'missing/unprovisioned' space may
provide some old data...

>
> Two (good) solution where given, both relying on scripting (see "thin_command"
> option on lvm.conf):
> - fsfreeze on a nearly full pool (ie: >=98%);
> - replace the dmthinp target with the error target (using dmsetup).

Yep - this all can happen via 'monitoring.
The key is to do it early before disaster happens.

> I really think that with the good scripting infrastructure currently built in
> lvm this is a more-or-less solved problem.

It still depends - there is always some sort of 'race' - unless you are
willing to 'give-up' too early to be always sure, considering there are
technologies that may write many GB/s...

>> Do NOT take thin snapshot of your root filesystem so you will avoid
>> thin-pool overprovisioning problem.
>
> But is someone *really* pushing thinp for root filesystem? I always used it

You can use rootfs with thinp - it's very fast for testing i.e. upgrades
and quickly revert back - just there should be enough free space.

> In stress testing, I never saw a system crash on a full thin pool, but I was
> not using it on root filesystem. There are any ill effect on system stability
> which I need to know?

Depends on version of kernel and filesystem in use.

Note RHEL/Centos kernel has lots of backport even when it's look quite old.

> The solution is to use scripting/thin_command with lvm tags. For example:
> - tag all snapshot with a "snap" tag;
> - when usage is dangerously high, drop all volumes with "snap" tag.

Yep - every user has different plans in his mind - scripting gives user
freedom to adapt this logic to local needs...

>>> However, I don't have the space for a full copy of every filesystem, so if
>>> I snapshot, I will automatically overprovision.

As long as admin responsible controls space in thin-pool and takes action
long time before thin-pool runs out-of-space all is fine.

If admin hopes in some kind of magic to happen - we have a problem....

>>
>> Back to rule #1 - thin-p is about 'delaying' deliverance of real space.
>> If you already have plan to never deliver promised space - you need to
>> live with consequences....
>
> I am not sure to 100% agree on that. Thinp is not only about "delaying" space
> provisioning; it clearly is also (mostly?) about fast, modern, usable
> snapshots. Docker, snapper, stratis, etc. all use thinp mainly for its fast,
> efficent snapshot capability. Denying that is not so useful and led to
> "overwarning" (ie: when snapshotting a volume on a virtually-fillable thin pool).

Snapshot are using space - with hope that if you will 'really' need that space
you either add this space to you system - or you drop snapshots.

Still the same logic applied....

>> !SNAPSHOTS ARE NOT BACKUPS!
>>
>> This is the key problem with your thinking here (unfortunately you are
>> not 'alone' with this thinking)
>
> Snapshot are not backups, as they do not protect from hardware problems (and
> denying that would be lame); however, they are an invaluable *part* of a
> successfull backup strategy. Having multiple rollaback target, even on the
> same machine, is a very usefull tool.

Backups primarily sits on completely different storage.

If you keep backup of data in same pool:

1.)
error on this in single chunk shared by all your backup + origin - means it's
total data loss - especially in case where filesystem are using 'BTrees' and
some 'root node' is lost - can easily render you origin + all backups
completely useless.

2.)
problems in thin-pool metadata can make all your origin+backups just an
unordered mess of chunks.

> Again, I don't understand by we are speaking about system crashes. On root
> *not* using thinp, I never saw a system crash due to full data pool. >
> Oh, and I use thinp on RHEL/CentOS only (Debian/Ubuntu backports are way too
> limited).

Yep - this case is known to be pretty stable.

But as said - with today 'rush' of development and load of updates - user do
want to try 'new disto upgrade' - if it works - all is fine - if it doesn't
let's have a quick road back - so using thin volume for rootfs is pretty
wanted case.

Trouble is there is quite a lot of issues non-trivial to solve.

There are also some on going ideas/projects - one of them was to have thinLVs
with priority to be always fully provisioned - so such thinLV could never be
the one to have unprovisioned chunks....
Other was a better integration of filesystem with 'provisioned' volumes.

Zdenek

Gionatan Danti

2018-03-01 07:14:14 UTC

Il 28-02-2018 22:43 Zdenek Kabelac ha scritto:
> On default - full pool starts to 'error' all 'writes' in 60 seconds.

Based on what I remember, and what you wrote below, I think "all writes"
in the context above means "writes to unallocated areas", right? Because
even full pool can write to already-provisioned areas.

> The main problem is - after reboot - this 'missing/unprovisioned'
> space may provide some old data...

Can you elaborate on this point? Are you referring to current behavior
or to an hypothetical "full read-only" mode?

> It still depends - there is always some sort of 'race' - unless you
> are willing to 'give-up' too early to be always sure, considering
> there are technologies that may write many GB/s...

Sure - this was the "more-or-less" part in my sentence.

> You can use rootfs with thinp - it's very fast for testing i.e.
> upgrades
> and quickly revert back - just there should be enough free space.

For testing, sure. However for a production machine I would rarely use
root on thinp. Maybe my reasoning is skewed by the fact that I mostly
work with virtual machines, so test/heavy upgrades are *not* done on the
host itself, rather on the guest VM.

>
> Depends on version of kernel and filesystem in use.
>
> Note RHEL/Centos kernel has lots of backport even when it's look quite
> old.

Sure, and this is one of the key reason why I use RHEL/CentOS rather
than Debian/Ubuntu.

> Backups primarily sits on completely different storage.
>
> If you keep backup of data in same pool:
>
> 1.)
> error on this in single chunk shared by all your backup + origin -
> means it's total data loss - especially in case where filesystem are
> using 'BTrees' and some 'root node' is lost - can easily render you
> origin + all backups completely useless.
>
> 2.)
> problems in thin-pool metadata can make all your origin+backups just
> an unordered mess of chunks.

True, but this not disprove the main point: snapshots are a invaluable
tool in building your backup strategy. Obviously, if thin-pool meta
volume has a problem, than all volumes (snapshot or not) become invalid.
Do you have any recovery strategy in this case? For example, the root
ZFS uberblock is written on *both* device start and end. Does something
similar exists for thinp?

>
> There are also some on going ideas/projects - one of them was to have
> thinLVs with priority to be always fully provisioned - so such thinLV
> could never be the one to have unprovisioned chunks....
> Other was a better integration of filesystem with 'provisioned'
> volumes.

Interesting. Can you provide some more information on these projects?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2018-03-01 08:31:02 UTC

Dne 1.3.2018 v 08:14 Gionatan Danti napsal(a):
> Il 28-02-2018 22:43 Zdenek Kabelac ha scritto:
>> On default - full pool starts to 'error' all 'writes' in 60 seconds.
>
> Based on what I remember, and what you wrote below, I think "all writes" in
> the context above means "writes to unallocated areas", right? Because even
> full pool can write to already-provisioned areas.

yes

>
>> The main problem is - after reboot - this 'missing/unprovisioned'
>> space may provide some old data...
>
> Can you elaborate on this point? Are you referring to current behavior or to
> an hypothetical "full read-only" mode?

If the tool wanted to write 1sector to 256K chunk that needed provisioning,
and provisioning was not possible - after reboot - you will still see
the 'old' content.

In case of filesystem, that does not stop upon 1st. failing write you then can
see a potential problem since fs could issue writes - where halve of them
were possibly written and other halve was errored - then you reboot,
and that 'error' halve is actually returning 'some old data' and this can make
filesystem seriously confused...
Fortunately both ext4 & xfs both have now correct logic here for journaling,
although IMHO still not optimal.

> True, but this not disprove the main point: snapshots are a invaluable tool in
> building your backup strategy. Obviously, if thin-pool meta volume has a
> problem, than all volumes (snapshot or not) become invalid. Do you have any
> recovery strategy in this case? For example, the root ZFS uberblock is written
> on *both* device start and end. Does something similar exists for thinp?

Unfortunately losing root blocks on thin-pool metadata is a big problem.
That's why metadata should be rather on some resilient fast storage.
Logic of writing should not let data corrupt (% broken kernel).

But yes - there is quite some room for improvement in thin_repair tool....

>> There are also some on going ideas/projects - one of them was to have
>> thinLVs with priority to be always fully provisioned - so such thinLV
>> could never be the one to have unprovisioned chunks....
>> Other was a better integration of filesystem with 'provisioned' volumes.
>
> Interesting. Can you provide some more information on these projects?

Likely watching Joe's pages (main thin-pool creator) and whatever XFS groups
is working on....

Also note - we are going to integrate VDO support - which will be a 2nd. way
for thin-provisioning with different set of features - missing snapshots, but
having compression & deduplication....

Regards

Zdenek

Gionatan Danti

2018-03-01 09:52:10 UTC

On 01/03/2018 09:31, Zdenek Kabelac wrote:
> If the tool wanted to write 1sector to 256K chunk that needed
> provisioning,
> and provisioning was not possible - after reboot - you will still see
> the 'old' content. >
> In case of filesystem, that does not stop upon 1st. failing write you
> then can see a potential problem since fs could issue writes - where
> halve of them
> were possibly written and other halve was errored - then you reboot,
> and that 'error' halve is actually returning 'some old data' and this
> can make filesystem seriously confused...
> Fortunately both ext4 & xfs both have now correct logic here for
> journaling,
> although IMHO still not optimal.

Ah ok, we are speaking about current "can write to allocated chunks only
when full" behavior. This is why I would greatly appreciate a "total
read only mode" on full pool.

Any insight on what ext4 and xfs changed to mitigate the problem? Even a
mailing list link would be very useful ;)

> Unfortunately losing root blocks on thin-pool metadata is a big problem.
> That's why metadata should be rather on some resilient fast storage.
> Logic of writing should not let data corrupt (% broken kernel).
>
> But yes - there is quite some room for improvement in thin_repair tool....

In the past, I fiddled with thin_dump to create backups of the metadata
device. Do you think it is a good idea? What somewhat scares me is that,
for thind_dump to work, the metadata device should be manually put in
"snapshot" mode and, after the dump, it had to be unfreezed. What will
happen if I forget to unfreeze it?

> Likely watching Joe's pages (main thin-pool creator) and whatever XFS
> groups is working on....

Again, do you have any links for quick sharing?

> Also note - we are going to integrate VDO support - which will be a 2nd.
> way for thin-provisioning with different set of features - missing
> snapshots, but having compression & deduplication....

I thought compression, deduplication, send/receive, etc. where worked on
the framework of stratis. What do you mean with "VDO support"?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2018-03-01 11:23:44 UTC

Dne 1.3.2018 v 10:52 Gionatan Danti napsal(a):
> On 01/03/2018 09:31, Zdenek Kabelac wrote:
>> If the tool wanted to write 1sector to 256K chunk that needed provisioning,
>> and provisioning was not possible - after reboot - you will still see
>> the 'old' content. >
>> In case of filesystem, that does not stop upon 1st. failing write you then
>> can see a potential problem since fs could issue writes - where halve of them
>> were possibly written and other halve was errored - then you reboot,
>> and that 'error' halve is actually returning 'some old data' and this can
>> make filesystem seriously confused...
>> Fortunately both ext4 & xfs both have now correct logic here for journaling,
>> although IMHO still not optimal.
>
> Ah ok, we are speaking about current "can write to allocated chunks only when
> full" behavior. This is why I would greatly appreciate a "total read only
> mode" on full pool.
>
> Any insight on what ext4 and xfs changed to mitigate the problem? Even a
> mailing list link would be very useful ;)

In general - for extX it's remount read-only upon error - which works for
journaled metadata - if you want same protection for 'data' you need to switch
to rather expensive data journaling mode.

For XFS there is now similar logic where write error on journal stops
filesystem usage - look far some older message (even here in this list) it's
been mentioned already few times I guess...

>> Unfortunately losing root blocks on thin-pool metadata is a big problem.
>> That's why metadata should be rather on some resilient fast storage.
>> Logic of writing should not let data corrupt (% broken kernel).
>>
>> But yes - there is quite some room for improvement in thin_repair tool....
>
> In the past, I fiddled with thin_dump to create backups of the metadata
> device. Do you think it is a good idea? What somewhat scares me is that, for

Depends on use-case - if you take snapshots of your thin volume, this likely
has will not help you with recovery at all.

If your thin-volumes are rather standalone only occasionally modified
'growing' fs images (so no trimming ;)) - then with this metadata backup
there can be some small chance you would be able to obtain some 'usable'
mappings of chunks to block device layout...

Personally I'd not recommend to use this at all unless you know rather
low-level details how this whole thing works....

> thind_dump to work, the metadata device should be manually put in "snapshot"
> mode and, after the dump, it had to be unfreezed. What will happen if I forget
> to unfreeze it?

Unfreezed filesystem is simply not usable...

>> Likely watching Joe's pages (main thin-pool creator) and whatever XFS groups
>> is working on....
>
> Again, do you have any links for quick sharing?

https://github.com/jthornber

>> Also note - we are going to integrate VDO support - which will be a 2nd. way
>> for thin-provisioning with different set of features - missing snapshots,
>> but having compression & deduplication....
>
> I thought compression, deduplication, send/receive, etc. where worked on the
> framework of stratis. What do you mean with "VDO support"?

Clearly Startis is not a topic for lvm2 at all ;) that's all I'm going to say
about this....

Regards

Zdenek

Gionatan Danti

2018-03-01 12:48:09 UTC

On 01/03/2018 12:23, Zdenek Kabelac wrote:
> In general - for extX it's remount read-only upon error - which works
> for journaled metadata - if you want same protection for 'data' you need
> to switch to rather expensive data journaling mode.
>
> For XFS there is now similar logic where write error on journal stops
> filesystem usage - look far some older message (even here in this list)
> it's been mentioned already few times I guess...

Yes, we discussed here the issue. If I recall correctly, XFS journal is
a circular buffer which will be always written to already-allocated
chunks. From my tests (June 2017) it was clear that failing async
writes, even with errorwhenfull=y, did not always trigger a prompt XFS
stop (but the filesystem eventally shut down after some more
writes/minutes).

> Depends on use-case - if you take snapshots of your thin volume, this
> likely has will not help you with recovery at all.
>
> If your thin-volumes are rather standalone only occasionally modified
> 'growing' fs images (so no trimming ;)) - then with this metadata
> backup there can be some small chance you would be able to obtain some
> 'usable' mappings of chunks to block device layout...
>
> Personally I'd not recommend to use this at all unless you know rather
> low-level details how this whole thing works....

Ok, I realized that and stopped using it for anything but testing.

> Unfreezed filesystem is simply not usable...

I was speaking about unfreezed thin metadata snapshot - ie:
reserve_metadata_snap *without* a corresponding release_metadata_snap.
Will that cause problems?

> Clearly Startis is not a topic for lvm2 at all ;) that's all I'm going
> to say about this....

OK :p

I think VDO is a fruit of Permabit acquisition, right? As it implements
it's own thin provisioning, will thinlvm migrate to VDO or it will
continue to use the current dmtarget?

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2018-03-01 16:00:17 UTC

Dne 1.3.2018 v 13:48 Gionatan Danti napsal(a):
>
> On 01/03/2018 12:23, Zdenek Kabelac wrote:
>> In general - for extX it's remount read-only upon error - which works for
>> journaled metadata - if you want same protection for 'data' you need to
>> switch to rather expensive data journaling mode.
>>
>> For XFS there is now similar logic where write error on journal stops
>> filesystem usage - look far some older message (even here in this list) it's
>> been mentioned already few times I guess...
>

There is quite 'detailed' config for XFS - just not all settings
are probably tuned in the best way for provisioning.

See:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/xfs-error-behavior

>
>> Unfreezed filesystem is simply not usable...
>
> I was speaking about unfreezed thin metadata snapshot - ie:
> reserve_metadata_snap *without* a corresponding release_metadata_snap. Will
> that cause problems?
>

metadata snapshot 'just consumes' thin-pool metadata space,
at any time there can be only 1 snapshot - so before next usage
you have to drop the existing one.

So IMHO it should have no other effects unless you hit some bugs...

> I think VDO is a fruit of Permabit acquisition, right? As it implements it's
> own thin provisioning, will thinlvm migrate to VDO or it will continue to use
> the current dmtarget?

thin-pool target is having different goals then VDO
so both targets will likely live together.

Possibly thin-pool might be tested for using VDO data volume if it makes any
sense...

Regards

Zdenek

Gionatan Danti

2018-03-01 16:26:29 UTC

On 01/03/2018 17:00, Zdenek Kabelac wrote:
> metadata snapshot 'just consumes' thin-pool metadata space,
> at any time there can be only 1 snapshot - so before next usage
> you have to drop the existing one.
>
> So IMHO it should have no other effects unless you hit some bugs...

Mmm... does it means that a not-release metadata snapshot will lead to
increased metadata volume usage (possibly filling it faster)?

> thin-pool target is having different goals then VDO
> so both targets will likely live together.
>
> Possibly thin-pool might be tested for using VDO data volume if it makes
> any sense...

Great. Thank you for the very informative discussion.
Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Gianluca Cecchi

2018-03-01 09:43:02 UTC

On Thu, Mar 1, 2018 at 9:31 AM, Zdenek Kabelac <***@redhat.com> wrote:

>
>
> Also note - we are going to integrate VDO support - which will be a 2nd.
> way for thin-provisioning with different set of features - missing
> snapshots, but having compression & deduplication....
>
> Regards
>
> Zdenek
>
>
Interesting.
I would have expected to find it already upstream, eg inside Fedora 27 to
begin to try, but it seems not here.
I found this for upcoming RH EL 7.5:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/vdo

but nothing neither in updates-testing for f27
Is this one below the only and correct source to test on Fedora:
https://github.com/dm-vdo/vdo
?

Thanks,
Gianluca

Zdenek Kabelac

2018-03-01 11:10:14 UTC

Dne 1.3.2018 v 10:43 Gianluca Cecchi napsal(a):
> On Thu, Mar 1, 2018 at 9:31 AM, Zdenek Kabelac <***@redhat.com
> <mailto:***@redhat.com>> wrote:
>
>
>
> Also note - we are going to integrate VDO support - which will be a 2nd.
> way for thin-provisioning with different set of features - missing
> snapshots, but having compression & deduplication....
>
> Regards
>
> Zdenek
>
>
> Interesting.
> I would have expected to find it already upstream, eg inside Fedora 27 to
> begin to try, but it seems not here.
> I found this for upcoming RH EL 7.5:
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/vdo
>
> but nothing neither in updates-testing for f27
> Is this one below the only and correct source to test on Fedora:
> https://github.com/dm-vdo/vdo
> ?
>

There is a COPR repository ATM available for certain f27 kernels.

For regular Fedora component VDO target needs to go into upstream kernel
first - but this needs some code changes for the module - so stay tuned....

Note - current model is 'standalone' usage of VDO devices - while we do plan
to integrate support for VDO as another segtype.

Regards

Zdenek

Xen

2018-03-03 18:32:25 UTC

Zdenek Kabelac schreef op 28-02-2018 22:43:

> It still depends - there is always some sort of 'race' - unless you
> are willing to 'give-up' too early to be always sure, considering
> there are technologies that may write many GB/s...

That's why I think it is only possible for snapshots.

> You can use rootfs with thinp - it's very fast for testing i.e.
> upgrades
> and quickly revert back - just there should be enough free space.

That's also possible with non-thin.

> Snapshot are using space - with hope that if you will 'really' need
> that space
> you either add this space to you system - or you drop snapshots.

And I was saying back then that it would be quite easy to have a script
that would drop bigger snapshots first (of larger volumes) given that
those are most likely less important and more likely to prevent thin
pool fillup, and you can save more smaller snapshots this way.

So basically I mean this gives your snapshots a "quotum" that I was
asking about.

Lol now I remember.

You could easily give (by script) every snapshot a quotum of 20% of full
volume size, then when 90% thin target is reached, you start dropping
volumes with the largest quotum first, or something.

Idk, something more meaningful than that, but you get the idea.

You can calculate the "own" blocks of the snapshot and when the pool is
full you check for snapshots that have surpassed their quotum, and the
ones that are past their quotas in the largest numbers you drop first.

> But as said - with today 'rush' of development and load of updates -
> user do want to try 'new disto upgrade' - if it works - all is fine -
> if it doesn't let's have a quick road back - so using thin volume for
> rootfs is pretty wanted case.

But again, regular snapshot of sufficient size does the same thing, you
just have to allocate for it in advance, but for root this is not really
a problem.

Then no more issue with thin-full problem.

I agree, less convenient, and a slight bit slower, but not by much for
this special use case.

> There are also some on going ideas/projects - one of them was to have
> thinLVs with priority to be always fully provisioned - so such thinLV
> could never be the one to have unprovisioned chunks....

That's what ZFS does... ;-).

> Other was a better integration of filesystem with 'provisioned'
> volumes.

That's what I was talking about back then...............

Zdenek Kabelac

2018-03-04 20:34:54 UTC

Dne 3.3.2018 v 19:32 Xen napsal(a):
> Zdenek Kabelac schreef op 28-02-2018 22:43:
>
>> It still depends - there is always some sort of 'race' - unless you
>> are willing to 'give-up' too early to be always sure, considering
>> there are technologies that may write many GB/s...
>
> That's why I think it is only possible for snapshots.
>
>> You can use rootfs with thinp - it's very fast for testing i.e. upgrades
>> and quickly revert back - just there should be enough free space.
>
> That's also possible with non-thin.
>
>> Snapshot are using space - with hope that if you will 'really' need that space
>> you either add this space to you system - or you drop snapshots.
>
> And I was saying back then that it would be quite easy to have a script that
> would drop bigger snapshots first (of larger volumes) given that those are
> most likely less important and more likely to prevent thin pool fillup, and
> you can save more smaller snapshots this way.
>
> So basically I mean this gives your snapshots a "quotum" that I was asking about.
>
> Lol now I remember.
>
> You could easily give (by script) every snapshot a quotum of 20% of full
> volume size, then when 90% thin target is reached, you start dropping volumes
> with the largest quotum first, or something.
>
> Idk, something more meaningful than that, but you get the idea.
>
> You can calculate the "own" blocks of the snapshot and when the pool is full
> you check for snapshots that have surpassed their quotum, and the ones that
> are past their quotas in the largest numbers you drop first.

I hope it's finally arriving to you that all your wishes CAN be implemented.
It's you to decide what kind of reaction and when it shall happen.

It's really only 'you' to use all the available tooling to do your own
'dreamed' setup and lvm2 & kernel target provides the tooling.

If you however hope lvm2 will ship 'script' perfectly tuned for Xen system,
it's just you to write and send a patch...

>
>> But as said - with today 'rush' of development and load of updates -
>> user do want to try 'new disto upgrade' - if it works - all is fine -
>> if it doesn't let's have a quick road back - so using thin volume for
>> rootfs is pretty wanted case.
>
> But again, regular snapshot of sufficient size does the same thing, you just
> have to allocate for it in advance, but for root this is not really a problem.
>
> Then no more issue with thin-full problem.
>
> I agree, less convenient, and a slight bit slower, but not by much for this
> special use case.

I've no idea what you mean by this...

>> There are also some on going ideas/projects - one of them was to have
>> thinLVs with priority to be always fully provisioned - so such thinLV
>> could never be the one to have unprovisioned chunks....
>
> That's what ZFS does... ;-).

ZFS is a 'single' filesystem.

thin-pool is multi-volume target.

It's approximately like if you would use your XFS/ext4 rootfs being placed
of ZFS ZVOL device - if you can provide an example, where this 'systems'
works more stable & better & faster than thin-pool, it's clear bug on
thin-pool - and your should open bugzilla for this.

Regards

Zdenek

Xen

2018-03-03 18:17:11 UTC

Gionatan Danti schreef op 28-02-2018 20:07:

> To recap (Zdeneck, correct me if I am wrong): the main problem is
> that, on a full pool, async writes will more-or-less silenty fail
> (with errors shown on dmesg, but nothing more).

Yes I know you were writing about that in the later emails.

> Another possible cause
> of problem is that, even on a full pool, *some* writes will complete
> correctly (the one on already allocated chunks).

Idem.

> In the past was argued that putting the entire pool in read-only mode
> (where *all* writes fail, but read are permitted to complete) would be
> a better fail-safe mechanism; however, it was stated that no current
> dmtarget permit that.

Right. Don't forget my main problem was system hangs due to older
kernels, not the stuff you write about now.

> Two (good) solution where given, both relying on scripting (see
> "thin_command" option on lvm.conf):
> - fsfreeze on a nearly full pool (ie: >=98%);
> - replace the dmthinp target with the error target (using dmsetup).
>
> I really think that with the good scripting infrastructure currently
> built in lvm this is a more-or-less solved problem.

I agree in practical terms. Doesn't make for good target design, but
it's good enough, I guess.

>> Do NOT take thin snapshot of your root filesystem so you will avoid
>> thin-pool overprovisioning problem.
>
> But is someone *really* pushing thinp for root filesystem? I always
> used it for data partition only... Sure, rollback capability on root
> is nice, but it is on data which they are *really* important.

No, Zdenek thought my system hangs resulted from something else and then
in order to defend against that (being the fault of current DM design)
he tried to raise the ante by claiming that root-on-thin would cause
system failure anyway with a full pool.

I never suggested root on thin.

> In stress testing, I never saw a system crash on a full thin pool

That's good to know, I was just using Jessie and Xenial.

> We discussed that in the past also, but as snapshot volumes really are
> *regular*, writable volumes (which a 'k' flag to skip activation by
> default), the LVM team take the "safe" stance to not automatically
> drop any volume.

Sure I guess any application logic would have to be programmed outside
of any (device mapper module) anyway.

> The solution is to use scripting/thin_command with lvm tags. For
> example:
> - tag all snapshot with a "snap" tag;
> - when usage is dangerously high, drop all volumes with "snap" tag.

Yes, now I remember.

I was envisioning some other tag that would allow a quotum to be set for
every volume (for example as a %) and the script would then drop the
volumes with the larger quotas first (thus the larger snapshots) so as
to protect smaller volumes which are probably more important and you can
save more of them. I am ashared to admit I had forgotten about that
completely ;-).

>> Back to rule #1 - thin-p is about 'delaying' deliverance of real
>> space.
>> If you already have plan to never deliver promised space - you need to
>> live with consequences....
>
> I am not sure to 100% agree on that.

When Zdenek says "thin-p" he might mean "thin-pool" but not generally
"thin-provisioning".

I mean to say that the very special use case of an always auto-expanding
system is a special use case of thin provisioning in general.

And I would agree, of course, that the other uses are also legit.

> Thinp is not only about
> "delaying" space provisioning; it clearly is also (mostly?) about
> fast, modern, usable snapshots. Docker, snapper, stratis, etc. all use
> thinp mainly for its fast, efficent snapshot capability.

Thank you for bringing that in.

> Denying that
> is not so useful and led to "overwarning" (ie: when snapshotting a
> volume on a virtually-fillable thin pool).

Aye.

>> !SNAPSHOTS ARE NOT BACKUPS!
>
> Snapshot are not backups, as they do not protect from hardware
> problems (and denying that would be lame)

I was really saying that I was using them to run backups off of.

> however, they are an
> invaluable *part* of a successfull backup strategy. Having multiple
> rollaback target, even on the same machine, is a very usefull tool.

Even more you can backup running systems, but I thought that would be
obvious.

> Again, I don't understand by we are speaking about system crashes. On
> root *not* using thinp, I never saw a system crash due to full data
> pool.

I had it on 3.18 and 4.4, that's all.

> Oh, and I use thinp on RHEL/CentOS only (Debian/Ubuntu backports are
> way too limited).

That could be it too.

Zdenek Kabelac

2018-03-04 20:53:17 UTC

Dne 3.3.2018 v 19:17 Xen napsal(a):

>> In the past was argued that putting the entire pool in read-only mode
>> (where *all* writes fail, but read are permitted to complete) would be
>> a better fail-safe mechanism; however, it was stated that no current
>> dmtarget permit that.
>
> Right. Don't forget my main problem was system hangs due to older kernels, not
> the stuff you write about now.
>
>> Two (good) solution where given, both relying on scripting (see
>> "thin_command" option on lvm.conf):
>> - fsfreeze on a nearly full pool (ie: >=98%);
>> - replace the dmthinp target with the error target (using dmsetup).
>>
>> I really think that with the good scripting infrastructure currently
>> built in lvm this is a more-or-less solved problem.
>
> I agree in practical terms. Doesn't make for good target design, but it's good
> enough, I guess.

Sometimes you have to settle on the good compromise.

There are various limitation coming from the way how Linux kernel works.

You probably still have 'vision' the block devices KNOWS from where the block
comes from. I.E. you probably think thin device is aware block is some
'write' from 'gimp' made by user 'adam'. The clear fact is - block layer
only knows some 'pages' with some sizes needs to be written at some location
on device - and that's all.

On the other hand all common filesystem in linux were always written to work
on a device where the space is simply always there. So all core algorithms
simple never counted with something like 'thin-provisioning' - this is almost
'fine' since thin-provisioning should be almost invisible - but the problem
starts to be visible on this over-provisioned conditions.

Unfortunately majority of filesystem never really tested well all those
'weird' conditions which are suddenly easy to trigger with thin-pool, but
likely almost never happens on real hdd....

So as said - situation gets better all the time, bugs are fixed as soon as the
problematic pattern/use case is discovered - that's why it's really important
users are opening bugzillas and report their problems with detailed
description how to hit their problem - this really DOES help a lot.

On the other hand it's really hard to do something for users how are
just saying 'goodbye to LVM'....

>> But is someone *really* pushing thinp for root filesystem? I always
>> used it for data partition only... Sure, rollback capability on root
>> is nice, but it is on data which they are *really* important.
>
> No, Zdenek thought my system hangs resulted from something else and then in
> order to defend against that (being the fault of current DM design) he tried
> to raise the ante by claiming that root-on-thin would cause system failure
> anyway with a full pool.

Yes - this is still true.
It's a core logic of linux kernel and pages caching works.

And that's why it's important to take action *BEFORE* then trying to solve the
case *AFTER* and hope the deadlock will not happen...

> I was envisioning some other tag that would allow a quotum to be set for every
> volume (for example as a %) and the script would then drop the volumes with
> the larger quotas first (thus the larger snapshots) so as to protect smaller
> volumes which are probably more important and you can save more of them. I am
> ashared to admit I had forgotten about that completely ;-).

Every user has quite different logic in mind - so really - we do provide
tooling and user has to choose what fits bets...

Regards

Zdenek

Gionatan Danti

2018-03-05 09:42:26 UTC

Il 04-03-2018 21:53 Zdenek Kabelac ha scritto:
> On the other hand all common filesystem in linux were always written
> to work on a device where the space is simply always there. So all
> core algorithms simple never counted with something like
> 'thin-provisioning' - this is almost 'fine' since thin-provisioning
> should be almost invisible - but the problem starts to be visible on
> this over-provisioned conditions.
>
> Unfortunately majority of filesystem never really tested well all
> those 'weird' conditions which are suddenly easy to trigger with
> thin-pool, but likely almost never happens on real hdd....

Hi Zdenek, I'm a little confused by that statement.
Sure, it is 100% true for EXT3/4-based filesystem; however, asking on
XFS mailing list about that, I get the definive answer that XFS was
adapted to cope well with thin provisioning ages ago. Is it the case?

Anyway, a more direct question: what prevented the device mapper team to
implement a full-read-only/fail-all-writes target? I feel that *many*
filesystem problems should be bypassed with full-read-only pools... Am I
wrong?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2018-03-05 10:18:01 UTC

Dne 5.3.2018 v 10:42 Gionatan Danti napsal(a):
> Il 04-03-2018 21:53 Zdenek Kabelac ha scritto:
>> On the other hand all common filesystem in linux were always written
>> to work on a device where the space is simply always there. So all
>> core algorithms simple never counted with something like
>> 'thin-provisioning' - this is almost 'fine' since thin-provisioning
>> should be almost invisible - but the problem starts to be visible on
>> this over-provisioned conditions.
>>
>> Unfortunately majority of filesystem never really tested well all
>> those 'weird' conditions which are suddenly easy to trigger with
>> thin-pool, but likely almost never happens on real hdd....
>
> Hi Zdenek, I'm a little confused by that statement.
> Sure, it is 100% true for EXT3/4-based filesystem; however, asking on XFS
> mailing list about that, I get the definive answer that XFS was adapted to
> cope well with thin provisioning ages ago. Is it the case?

Yes - it has been updated/improved/fixed - and I've already given you a link
where you can configure the behavior of XFS when i.e. device reports ENOSPC
to the filesystem.

What need to be understood here is - filesystem were not originally designed
to ever see such kind of errors - where you simply created filesystem in past,
the space was meant to be there all the time.

> Anyway, a more direct question: what prevented the device mapper team to
> implement a full-read-only/fail-all-writes target? I feel that *many*
> filesystem problems should be bypassed with full-read-only pools... Am I wrong?

Well complexity - it might look 'easy' to do on the first sight, but in
reality it's impacting all hot/fast paths with number of checks and it would
have rather dramatic performance impact.

The other case is, while for lots of filesystems it might look like best thing
- it's not always true - so there are case where it's more desired
to have still working device with 'several' failing piece in it...

And 3rd moment is - it's unclear from kernel POV - where this 'full' pool
moment actually happens - i.e. imagine running 'write' operation on one thin
device and 'trim/discard' operation running on 2nd. device.

So it's been left on user-space to solve the case the best way -
i.e. user-space can initiate 'fstrim' itself when full pool case happens or
get the space by number of other ways...

Regards

Zdenek

Gionatan Danti

2018-03-05 14:27:09 UTC

On 05/03/2018 11:18, Zdenek Kabelac wrote:
> Yes - it has been updated/improved/fixed - and I've already given you a
> link where you can configure the behavior of XFS when i.e. device
> reports ENOSPC to the filesystem.

Sure - I already studied it months ago during my testing. I simply was
under the impression that dm & xfs teams have different point of view
regarding the actual status. I'm happy to know that it isn't the case :)

> Well complexity - it might look 'easy' to do on the first sight, but in
> reality it's impacting all hot/fast paths with number of checks and it
> would have rather dramatic performance impact.
>
> The other case is, while for lots of filesystems it might look like best
> thing - it's not always true - so there are case where it's more desired
> to have still working device with 'several' failing piece in it...
>
> And 3rd moment is - it's unclear from kernel POV - where this 'full'
> pool moment actually happens - i.e. imagine running 'write' operation
> on one thin device and 'trim/discard' operation running on 2nd. device.
>
> So it's been left on user-space to solve the case the best way -
> i.e. user-space can initiate 'fstrim' itself when full pool case
> happens or get the space by number of other ways...

Ok, I see.
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

2018-03-03 17:52:22 UTC

I did not rewrite this entire message, please excuse the parts where I
am a little more "on the attack".

Zdenek Kabelac schreef op 28-02-2018 10:26:

> I'll probably repeat my self again, but thin provision can't be
> responsible for all kernel failures. There is no way DM team can fix
> all the related paths on this road.

Are you saying there are kernel bugs presently?

> If you don't plan to help resolving those issue - there is not point
> in complaining over and over again - we are already well aware of this
> issues...

I'm not aware of any issues, what are they?

I was responding here to an earlier thread I couldn't respond to back
then, the topic was whether it was possible to limit thin snapshot
sizes, you said it wasn't, I was just recapping this thread.

> If the admin can't stand failing system, he can't use thin-p.

That just sounds like a blanket excuse for any kind of failure.

> Overprovisioning on DEVICE level simply IS NOT equivalent to full
> filesystem like you would like to see all the time here and you've
> been already many times explained that filesystems are simply not
> there ready - fixes are on going but it will take its time and it's
> really pointless to exercise this on 2-3 year old kernels...

Pardon me, but your position has typically been that it is fundamentally
impossible, not that "we're not there yet".

My questions have always been about fundamental possibilities, to which
you always answer in the negative.

If something is fundamentally impossible, don't be surprised if you then
don't get any help in getting there: you always close off all paths
leading towards it.

You shut off any interest, any discussion, and any development interest
in paths that a long time later, you then say "we're working on it"
whereas before you always said "it's impossible".

This happened before where first you say "It's not a problem, it's admin
error" and then a year later you say "Oh yeah, it's fixed now".

Which is it?

My interest has always been, at least philosophically, or concerning
principle abilities, in development and design, but you shut it off
saying it's impossible.

Now you complain you are not getting any help.

> Thin provisioning has it's use case and it expects admin is well aware
> of possible problems.

That's a blanket statement once more that says nothing about actual
possibilities or impossibilities.

> If you are aiming for a magic box working always right - stay away
> from thin-p - the best advice....

Another blanket statement excusing any and all mistakes or errors or
failures the system could ever have.

> Do NOT take thin snapshot of your root filesystem so you will avoid
> thin-pool overprovisioning problem.

Zdenek, could you please make up your mind?

You brought up thin snapshotting as a reason for putting root on thin,
as a way of saying that thin failure would lead to system failure and
not just application failure,

whereas I maintained that application failure was acceptable.

I tried to make the distinction between application level failure (due
to filesystem errors) and system instability caused by thin.

You then tried to make those equivalent by saying that you can also put
root on thin, in which case application failure becomes system failure.

I never wanted root on thin, so don't tell me not to snapshot it, that
was your idea.

> Rule #1:
>
> Thin-pool was never targeted for 'regular' usage of full thin-pool.

All you are asked is to design for error conditions.

You want only to take care of the special use case where nothing bad
happens.

Why not just take care of the general use case where bad things can
happen?

You know, real life?

In any development process you first don't take care of all error
conditions, you just can't be bothered with them yet. Eventually, you
do.

It seems you are trying to avoid having to deal with the glaring error
conditions that have always existed, but you are trying to avoid having
to take any responsibility for it by saying that it was not part of the
design.

To make this more clear Zdenek, your implementation does not cater to
the general use case of thin provisioning, but only to the special use
case where full thin pools never happen.

That's a glaring omission in any design. You can go on and on on how
thin-p was not "targetted" at that "use case", but that's like saying
you built a car engine that was not "targetted" at "running out of
fuel".

Then when the engine breaks down you say it's the user's fault.

Maybe retarget your design?

Running out of fuel is not a use case.

It's a failure condition that you have to design for.

> Full thin-pool is serious ERROR condition with bad/ill effects on
> systems.

Yes and your job as a systems designer is to design for those error
conditions and make sure they are handled gracefully.

You just default on your responsibility there.

The reason you brought up root on thin was to elevate application
failure to the level of system failure so as to make them equivalent and
then to say that you can't do anything about system failure.

This is a false analogy, we only care about application failure in the
general use case of stuff that is allowed to happen, and we distinguish
system failure which is not allowed to happen.

Yes Zdenek, system failure is your responsibility as the designer, it's
not the admin's job except when he has a car that breaks down when the
fuel runs out.

But that, clearly, would be seen as a failure on behalf of the one
designing the engine.

You are responding so defensively when I have barely said anything that
it is clear you feel extremely guilty about this.

> Thin-pool was designed to 'delay/postpone' real space usage - aka you
> can use more 'virtual' space with the promise you deliver real storage
> later.

That doesn't cover the full spectrum of what we consider to be "thin
provisioning".

You only designed for a very special use case in which the fuel never
runs out.

The aeroplane that doesn't need landing gear because it was designed to
never run out of fuel.

In the kingdom of birds, only swallows do that. Most other birds are
designed for landing and taking off too.

You built a system that only works if certain conditions are met.

I'm just saying you could expand your design and cover the error
conditions as well.

So yes: I hear you, you didn't design for the error condition.

That's all I've been saying.

> So if you have different goals - like having some kind of full
> equivalency logic to full filesystem - you need to write different
> target....

Maybe I could but I still question why it was not designed into thin-p,
and I also doubt that it couldn't be redesigned into it.

I mean I doubt that it would require a huge rewrite, I think that if I
were to do that thing, I could start off with thin-p just fine.

Certainly, interesting, and a worthy goal.

There are a billion fun projects I would like to take on, but generally
I am just inquiring, sometimes I am angry about stuff not working,

but when I talk about "could this be possible?" you can assume I am
talking from a development perspective, and you don't have to constantly
only defend the thing currently not existing.

Sometimes I am just asking about possibilities.

"Yes, it doesn't exist, and it would require that and that and that"
would be a sufficient answer.

I don't always need to hear all of the excuses as to why it isn't so,
sometimes I just wonder how it could be done.

>> I simply cannot reconcile an attitude that thin-full-risk is
>> acceptable and the admin's job while at the same time advocating it
>> for root filesystems.
>
> Do NOT use thin-provinioning - as it's not meeting your requirements.

It was your suggestion to use thin for root as a way of artificially
increasing those requirements and then saying that they can't be met.

> Big news - we are at ~4.16 kernel upstream - so noone is really
> taking much care about 4.4 troubles here - sorry about that....

I said back then.

You don't really listen, do you...

> Speaking of 4.4 - I'd generally advice to jump to higher versions of
> kernel ASAP - since 4.4 has some known bad behavior in the case
> thin-pool 'metadata' get overfilled.

I never said that I was using 4.4, if you took care to read you would
see that I was speaking about the past.

Xenial is at 4.13 right now.

> There is on going 'BOOM' project - check it out please....

Okay...

> There is not much point in commenting support for some old distros
> other then you really should try harder with your distro
> maintainers....

I was just explaining why I was experiencing hangs and you didn't know
what I was talking about, causing some slight confusion in our threads.

>> That's a lot easier if your root filesystem doesn't lock up.
>
> - this is not really a fault of dm thin-provisioning kernel part.

I was saying, Zdenek, that your suggestion to use root on thin was
rather unwise.

I don't know what you're defending against, I never said anything other
than that.

> - on going fixes to file systems are being pushed upstream (for years).
> - fixes will not appear in years old kernels as such patches are
> usually invasive so unless you use pay someone to do the backporting
> job the easiest way forward is to user newer improved kernel..

I understand that you are mixing up my system hangs with the above
problems you would have by using root on full thin pool, I have already
accepted that the system hangs are fixed in later kernels.

> ATM thin-pool can't deliver equivalent logic - just like old-snaps
> can't deliver thin-pool logic.

Sure, but my question was never about "ATM".

I asked about potential, not status quo.

Please, if you keep responding to development inquiries with status quo
answers, you will never find any help in getting there.

The "what is" and the "what is to be" don't have to be the same, but you
are always responding defensively as to the "what is", not understanding
the questions.

Those system hangs, sure, status quo. Those snapshots? Development
interest.

>> However, I don't have the space for a full copy of every filesystem,
>> so if I snapshot, I will automatically overprovision.
>
> Back to rule #1 - thin-p is about 'delaying' deliverance of real space.
> If you already have plan to never deliver promised space - you need to
> live with consequences....

Like I said, I was just INQUIRING about the possibility of limiting the
size of a thin snapshot.

The fact that you respond so defensively with respect to thin pools
overflowing, means you feel and are guilty about not taking care of that
situation.

I was inquiring about a way to prevent thin pool overflow.

If you then suggest that the only valid use case is to have some
auto-expanding pool, then either you are not content with just giving
the answer to that question, or you feel it's your fault that something
isn't possible and you try to avoid that by putting the blame on the
user for "using it in a wrong way".

I asked a technical question. You respond like a guy who is asked why he
didn't clean the bathroom.

According to schedule.

Easy now, I just asked whether it was possible or not.

I didn't ask you to explain why it hasn't been done.

Or where to put the blame for that.

I would say you feel rather guilty and to every insinuation that there
is a missing feature you respond with great noise as to why the feature
isn't actually missing.

So if I say "Is this possible?" you respond with "YOU ARE USING IT THE
WRONG WAY" as if to feel rather uneasy to say that something isn't
possible.

Which again, leads, of course, to bad design.

Your uneasiness Zdenek is the biggest signpost here.

Sorry to be so liberal here.

>> My snapshots are indeed meant for backups (of data volumes) ---- not
>> for rollback ----- and for rollback ----- but only for the root
>> filesystem.
>
> There is more fundamental problem here:
>
> !SNAPSHOTS ARE NOT BACKUPS!

Can you please stop screaming?

Do I have to spell out that I use the snapshot to make the backup and
then discard the snapshot?

> This is the key problem with your thinking here (unfortunately you are
> not 'alone' with this thinking)

Yeah, maybe you shouldn't jump to conclusions and learn to read better.

>> My problem was system hangs, but my question was about limiting
>> snapshot size on thin.
>
> Well your problem primarily is usage of too old system....

I said "was", learn to read, Zdenek.

> Sorry to say this - but if you insist to stick with old system

Where did I say that? I said that back then, I had an 4.4 system that
experienced these issues.

> - ask
> your distro maintainers to do all the backporting work for you - this
> is nothing lvm2 can help with...

I explained to you that our confusion back then was due to my using the
then-current release of Ubuntu Xenial which had these problems.

I was just responding to an old thread with these conclusions:

1) Our confusion with respect to those "system hangs" was due to the
fact that you didn't know what I was talking about, thus I thought you
were excusing them, when you weren't.

2) My only inquiry had been about preventing snapshot overflow.

Zdenek Kabelac

2018-03-04 23:27:42 UTC

This post might be inappropriate. Click to display it.

Zdenek Kabelac

2017-04-22 21:22:03 UTC

Dne 22.4.2017 v 09:14 Gionatan Danti napsal(a):
> Il 14-04-2017 10:24 Zdenek Kabelac ha scritto:
>> However there are many different solutions for different problems -
>> and with current script execution - user may build his own solution -
>> i.e. call
>> 'dmsetup remove -f' for running thin volumes - so all instances get
>> 'error' device when pool is above some threshold setting (just like
>> old 'snapshot' invalidation worked) - this way user will just kill
>> thin volume user task, but will still keep thin-pool usable for easy
>> maintenance.
>>
>
> This is a very good idea - I tried it and it indeed works.
>
> However, it is not very clear to me what is the best method to monitor the
> allocated space and trigger an appropriate user script (I understand that
> versione > .169 has %checkpoint scripts, but current RHEL 7.3 is on .166).
>
> I had the following ideas:
> 1) monitor the syslog for the "WARNING pool is dd.dd% full" message;
> 2) set a higher than 0 low_water_mark and cache the dmesg/syslog
> "out-of-data" message;
> 3) register with device mapper to be notified.
>
> What do you think is the better approach? If trying to register with device
> mapper, how can I accomplish that?
>
> One more thing: from device-mapper docs (and indeed as observerd in my tests),
> the "pool is dd.dd% full" message is raised one single time: if a message is
> raised, the pool is emptied and refilled, no new messages are generated. The
> only method I found to let the system re-generate the message is to
> deactiveate and reactivate the thin pool itself.

ATM there is even bug for 169 & 170 - dmeventd should generate message
at 80,85,90,95,100 - but it does it only once - will be fixed soon...

>> ~16G so you can't even extend it, simply because it's
>> unsupported to use any bigger size
>
> Just out of curiosity, in such a case, how to proceed further to regain access
> to data?
>
> And now the most burning question ... ;)
> Given that thin-pool is under monitor and never allowed to fill data/metadata
> space, as do you consider its overall stability vs classical thick LVM?

Not seen metadata error for quite long time...
Since all the updates are CRC32 protected it's quite solid.

Regards

Zdenek

Gionatan Danti

2017-04-24 13:49:58 UTC

On 22/04/2017 23:22, Zdenek Kabelac wrote:
> ATM there is even bug for 169 & 170 - dmeventd should generate message
> at 80,85,90,95,100 - but it does it only once - will be fixed soon...

Mmm... quite a bug, considering how important is monitoring. All things
considered, what do you feel is the better approach to monitor? It is
possibile to register for dmevents?

> Not seen metadata error for quite long time...
> Since all the updates are CRC32 protected it's quite solid.

Great! Are the metadata writes somehow jounaled or are written in-place?

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2017-04-24 14:48:00 UTC

Dne 24.4.2017 v 15:49 Gionatan Danti napsal(a):
>
>
> On 22/04/2017 23:22, Zdenek Kabelac wrote:
>> ATM there is even bug for 169 & 170 - dmeventd should generate message
>> at 80,85,90,95,100 - but it does it only once - will be fixed soon...
>
> Mmm... quite a bug, considering how important is monitoring. All things
> considered, what do you feel is the better approach to monitor? It is
> possibile to register for dmevents?

Not all that big once - you get 1 WARNING always.
And releases 169 & 170 are clearly marked as developer releases - so they are
meant for testing and discovering these bugs...

>> Not seen metadata error for quite long time...
>> Since all the updates are CRC32 protected it's quite solid.
>
> Great! Are the metadata writes somehow jounaled or are written in-place?

Surely there is journal

Zdenek

91 Replies
623 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Gionatan Danti 2017-04-06 14:31:31 UTC

Mark Mielke 2017-04-07 08:19:24 UTC

Gionatan Danti 2017-04-07 09:12:25 UTC

L A Walsh 2017-04-07 13:50:34 UTC

Gionatan Danti 2017-04-07 16:33:47 UTC

Stuart Gathman 2017-04-13 12:59:07 UTC

Xen 2017-04-13 13:52:17 UTC

Zdenek Kabelac 2017-04-13 14:33:30 UTC

Xen 2017-04-13 14:47:41 UTC

Stuart Gathman 2017-04-13 15:29:34 UTC

Xen 2017-04-13 15:43:18 UTC

Stuart D. Gathman 2017-04-13 17:26:46 UTC

Stuart D. Gathman 2017-04-13 17:32:00 UTC

Xen 2017-04-14 15:17:55 UTC

Gionatan Danti 2017-04-14 07:27:20 UTC

Gionatan Danti 2017-04-14 07:23:17 UTC

Xen 2017-04-14 15:23:40 UTC

Gionatan Danti 2017-04-14 15:53:18 UTC

Stuart Gathman 2017-04-14 16:08:58 UTC

Xen 2017-04-14 17:36:33 UTC

Gionatan Danti 2017-04-14 18:59:01 UTC

Xen 2017-04-14 19:20:53 UTC

Xen 2017-04-15 08:27:46 UTC

Xen 2017-04-15 23:35:16 UTC

Xen 2017-04-17 12:33:35 UTC

Xen 2017-04-15 21:22:58 UTC

Xen 2017-04-15 21:49:57 UTC

Xen 2017-04-15 21:48:32 UTC

Zdenek Kabelac 2017-04-18 10:17:09 UTC

Gionatan Danti 2017-04-18 13:23:50 UTC

Stuart D. Gathman 2017-04-18 14:32:34 UTC

Xen 2017-04-19 07:22:53 UTC

Gionatan Danti 2017-04-08 11:56:50 UTC

Tomas Dalebjörk 2017-04-07 18:21:36 UTC

Gionatan Danti 2017-04-13 10:20:10 UTC

Xen 2017-04-13 12:41:45 UTC

Gionatan Danti 2017-04-14 07:20:14 UTC

Zdenek Kabelac 2017-04-14 08:24:10 UTC

Gionatan Danti 2017-04-14 09:07:53 UTC

Zdenek Kabelac 2017-04-14 09:37:37 UTC

Gionatan Danti 2017-04-14 09:55:21 UTC

Xen 2017-04-22 16:32:11 UTC

Gionatan Danti 2017-04-22 20:58:10 UTC

Zdenek Kabelac 2017-04-22 21:17:42 UTC

Xen 2017-04-23 05:29:32 UTC

Zdenek Kabelac 2017-04-23 09:26:43 UTC

Xen 2017-04-24 21:02:36 UTC

Zdenek Kabelac 2017-04-24 21:59:06 UTC

Gionatan Danti 2017-04-26 07:26:36 UTC

Zdenek Kabelac 2017-04-26 07:42:46 UTC

Gionatan Danti 2017-04-26 08:10:24 UTC

Zdenek Kabelac 2017-04-26 11:23:44 UTC

Gionatan Danti 2017-04-26 13:37:37 UTC

Zdenek Kabelac 2017-04-26 14:33:15 UTC

Gionatan Danti 2017-04-26 16:37:37 UTC

Stuart Gathman 2017-04-26 18:32:49 UTC

Stuart Gathman 2017-04-26 19:24:54 UTC

Gionatan Danti 2017-05-02 11:00:37 UTC

Gionatan Danti 2017-05-12 13:02:58 UTC

Joe Thornber 2017-05-12 13:42:02 UTC

Gionatan Danti 2017-05-14 20:39:21 UTC

Zdenek Kabelac 2017-05-15 12:50:52 UTC

Gionatan Danti 2017-05-15 14:48:17 UTC

Zdenek Kabelac 2017-05-15 15:33:19 UTC

Gionatan Danti 2017-05-16 07:53:33 UTC

Zdenek Kabelac 2017-05-16 10:54:01 UTC

Gionatan Danti 2017-05-16 13:38:56 UTC

Xen 2018-02-27 18:39:44 UTC

Zdenek Kabelac 2018-02-28 09:26:44 UTC

Gionatan Danti 2018-02-28 19:07:08 UTC

Zdenek Kabelac 2018-02-28 21:43:26 UTC

Gionatan Danti 2018-03-01 07:14:14 UTC

Zdenek Kabelac 2018-03-01 08:31:02 UTC

Gionatan Danti 2018-03-01 09:52:10 UTC

Zdenek Kabelac 2018-03-01 11:23:44 UTC

Gionatan Danti 2018-03-01 12:48:09 UTC

Zdenek Kabelac 2018-03-01 16:00:17 UTC

Gionatan Danti 2018-03-01 16:26:29 UTC

Gianluca Cecchi 2018-03-01 09:43:02 UTC

Zdenek Kabelac 2018-03-01 11:10:14 UTC

Xen 2018-03-03 18:32:25 UTC

Zdenek Kabelac 2018-03-04 20:34:54 UTC

Xen 2018-03-03 18:17:11 UTC

Zdenek Kabelac 2018-03-04 20:53:17 UTC

Gionatan Danti 2018-03-05 09:42:26 UTC

Zdenek Kabelac 2018-03-05 10:18:01 UTC

Gionatan Danti 2018-03-05 14:27:09 UTC

Xen 2018-03-03 17:52:22 UTC

Zdenek Kabelac 2018-03-04 23:27:42 UTC

Zdenek Kabelac 2017-04-22 21:22:03 UTC

Gionatan Danti 2017-04-24 13:49:58 UTC

Zdenek Kabelac 2017-04-24 14:48:00 UTC

about - legalese

Loading...