[linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled

Post by Gionatan Danti
Hi list,
I had an unexptected filesystem unmount on a machine were I am using thin
provisioning.

Hi

Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.

Post by Gionatan Danti
It is a CentOS 7.2 box (kernel 3.10.0-327.3.1.el7, lvm2-2.02.130-5.el7_2.1),
# lvs -a
LV VG Attr LSize Pool Origin
Data% Meta% Move Log Cpy%Sync Convert
000-ThinPool vg_storage twi-aotz-- 10.85t 74.06 33.36
[000-ThinPool_tdata] vg_storage Twi-ao---- 10.85t
[000-ThinPool_tmeta] vg_storage ewi-ao---- 88.00m
Storage vg_storage Vwi-aotz-- 10.80t 000-ThinPool 74.40
[lvol0_pmspare] vg_storage ewi------- 88.00m
root vg_system -wi-ao---- 55.70g
swap vg_system -wi-ao---- 7.81g
As you can see, thin pool/volume is at about 75%.
Today I found the Storage volume unmounted, with the following entries in
May 15 09:02:53 storage lvm[43289]: Request to lookup VG vg_storage in lvmetad
gave response Connection reset by peer.
May 15 09:02:53 storage lvm[43289]: Volume group "vg_storage" not found
May 15 09:02:53 storage lvm[43289]: Failed to extend thin
vg_storage-000--ThinPool-tpool.
May 15 09:02:53 storage lvm[43289]: Unmounting thin volume
vg_storage-000--ThinPool-tpool from /opt/storage.

Basically whenever 'lvresize' failed - dmeventd plugin now tries
to unconditionally umount any associated thin-volume with
thin-pool above threshold.

Post by Gionatan Danti
What puzzle me is that both thin_pool_autoextend_threshold and
snap_pool_autoextend_threshold are disabled in the lvm.conf file
(thin_pool_autoextend_threshold = 100 and snap_pool_autoextend_threshold =
100). Moreover, no custom profile/policy is attached to the thin pool/volume.

For now - plugin 'calls' the tool - lvresize --use-policies.
If this tool FAILs for ANY reason -> umount will happen.

I'll probably put in 'extra' test that 'umount' happens
with >=95% values only.

dmeventd itself has no idea if there is configure 100 or less - it's the
lvresize to see it - so even if you set 100% - and you have enabled
monitoring - you will get umount (but no resize)

Post by Gionatan Danti
To me, it seems that the lvmetad crashed/had some problems and the system,
being "blind" about the thin volume utilization, put it offline. But I can not
understand the "Failed to extend thin vg_storage-000--ThinPool-tpool", and I
had *no* autoextend in place.

If you strictly don't care about any tracing of thin-pool fullness,
disable monitoring in lvm.conf.

Post by Gionatan Danti
I rebooted the system and the Storage volume is now mounted without problems.
I also tried to write about 16 GB of raw data to it, and I have no problem.
However, I can not understand why it was put offline in the first place. As a
last piece of information, I noted that kernel & lvm was auto-updated two days
ago. Maybe it is related?
Can you give me some hint of what happened, and how to avoid it in the future?

Well 'lvmetad' shall not crash, ATM this may kill commands - and further stop
processing - as we rather 'stop' further usage rather than allowing
to cause bigger damage.

So if you have unusual system/device setup causing 'lvmetad' crash - open BZ,
and meawhile set 'use_lvmetad=0' in your lvm.conf till the bug is fixed.

Regards

Zdenek

Xen

2016-05-16 13:01:59 UTC

Post by Zdenek Kabelac
Hi
Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.

May I inquire into whether this is only available in newer versions atm?

These are my versions (lvs --version):

LVM version: 2.02.133(2) (2015-10-30)
Library version: 1.02.110 (2015-10-30)
Driver version: 4.34.0

I have noticed that on my current system the entire system will
basically freeze on overfill.

I wonder if I should take measures to upgrade to something newer if I
want to prevent this.

Regards,

X.

Zdenek Kabelac

2016-05-16 14:09:24 UTC

Post by Zdenek Kabelac
Hi
Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.

May I inquire into whether this is only available in newer versions atm?
LVM version: 2.02.133(2) (2015-10-30)
Library version: 1.02.110 (2015-10-30)
Driver version: 4.34.0
I have noticed that on my current system the entire system will basically
freeze on overfill.
I wonder if I should take measures to upgrade to something newer if I want to
prevent this.

Behavior should be there for quite a while, but relatively recent fixes
in dmeventd has made it working more reliable in more circumstances.
I'd recommend to play at least with 142 - but since recent releases are
bugfix oriented - if you are compiling yourself - just take latest.

Regards

Zdenek

Xen

2016-05-16 19:25:39 UTC

Post by Zdenek Kabelac
Behavior should be there for quite a while, but relatively recent fixes
in dmeventd has made it working more reliable in more circumstances.
I'd recommend to play at least with 142 - but since recent releases are
bugfix oriented - if you are compiling yourself - just take latest.

Thanks Zdenek. You know I've had an interest in more thin safety, and
though I may have been an ass sometimes, and though I was told that the
perfect admin never runs into trouble ;-), I'm still concerned with
actual practical measures :p.

I don't use my thin volumes for the system. That is difficult anyway
because Grub doesn't allow it (although I may start writing for that at
some point). But for me, a frozen volume would be vastly superior to the
system locking up.

So while I was writing all of that ..material, I didn't realize that in
my current system's state, the thing would actually cause the entire
system to freeze. Not directly, but within a minute or so everything
came to a halt. When I rebooted, all of the volumes were filled 100%,
that is to say, all of the thin capacities added up to a 100% for the
thin pool, and the pool itself was at 100%.

I didn't check the condition of the filesystem. You would assume it
would contain partially written files.

If there is anything that would actually freeze the volume but not bring
the system down, I would be most happy. But possibly it's the (ext)
filesystem driver that makes trouble? Like we said, if there is no way
to communicate space-fullness, what is it going to do right?

So is that dmeventd supposed to do anything to prevent disaster? Would I
need to write my own plugin/configuration for it?

It is not running on my system currently. Without further amendments of
course the only thing it could possibly do is to remount a filesystem
read-only, like others have indicated it possibly already could.

Maybe it would even be possible to have a kernel module that blocks a
certain kind of writes, but these things are hard, because the kernel
doesn't have a lot of places to hook onto by design. You could simply
give the filesystem (or actually the code calling for a write) write
failures back.

All of that code is not filesystem dependent, in the sense that you can
simply capture those writes in the VFS system, and not pass them on. At
the cost of some extra function calls. But then you would need that
module to know that certain volumes are read-only, and others aren't.
All in all not very hard to do, if you know how to do the concurrency.
In that case you could have a dmeventd plugin that would set this state,
and possibly a user tool that would unset it. The state is set for all
of the volumes of a thin pool, so the user tool would only need to unset
this for the thin pool, not the volumes. In practice, in the beginning,
this would be all you would need.

So I am just currently wondering about that

what other people have said. That the system already does this (mounting
read-only).

I believe my test system just failed because the writes only took a few
seconds to fill up the volume. Not a very good test, sorry. I didn't
realize that, that it would check only in intervals.

I still wonder what freezes my system like that.

Regards, B.

And I'm sorry for any .... disturbance I may have caused here. Regards.

Xen

2016-05-16 21:39:29 UTC

In likelyhood there is just a delay, but I am sending this mail again in

Zdenek Kabelac

2016-05-17 09:43:25 UTC

I don't use my thin volumes for the system. That is difficult anyway because
Grub doesn't allow it (although I may start writing for that at some point).

There is no plan ATM to support boot from thinLV in nearby future.
Just use small boot partition - it's the safest variant - it just hold kernels
and ramdisks...

We aim for a system with boot from single 'linear' with individual kernel +
ramdisk.

It's simple, efficient and can be easily achieved with existing tooling with
some 'minor' improvements in dracut to easily allow selection of system to be
used with given kernel as you may prefer to boot different thin snapshot of
your root volume.

Complexity of booting right from thin is very high with no obvious benefit.

But for me, a frozen volume would be vastly superior to the system locking up.

You miss the knowledge how the operating system works.

Your binary is 'mmap'-ed for a device. When the device holding binary
freezes, your binary may freeze (unless it is mlocked in memory).

So advice here is simple - if you want to run unfreezable system - simply do
not run this from a thin-volume.

So while I was writing all of that ..material, I didn't realize that in my
current system's state, the thing would actually cause the entire system to
freeze. Not directly, but within a minute or so everything came to a halt.
When I rebooted, all of the volumes were filled 100%, that is to say, all of
the thin capacities added up to a 100% for the thin pool, and the pool itself
was at 100%.
I didn't check the condition of the filesystem. You would assume it would
contain partially written files.

ATM there are some 'black holes' as filesystem were not deeply tested in all
corner cases which now could be 'easily' hit with thin usage.
This is getting improved - but advice "DO NOT" run thin-pool 100% still applies.

If there is anything that would actually freeze the volume but not bring the
system down, I would be most happy. But possibly it's the (ext) filesystem
driver that makes trouble? Like we said, if there is no way to communicate
space-fullness, what is it going to do right?

The best advice we have - 'monitor' fullness - when it's above - stop using
such system and ensure there will be more space - there is noone else to do
this task for you - it's the price you pay for overprovisioning.

So is that dmeventd supposed to do anything to prevent disaster? Would I need
to write my own plugin/configuration for it?

dmeventd only monitors and calls command to try to resize, and may try to
umount volumes in case disaster is approaching.

We plan to add more 'policy' logic - so you would be able to define what
should happen when some fullness is reached - but that's just plan ATM.

If you need something 'urgently' now - you could i.e. monitor your syslog
message for 'dmeventd' report and run i.e. 'reboot' in some case...

It is not running on my system currently. Without further amendments of course
the only thing it could possibly do is to remount a filesystem read-only, like
others have indicated it possibly already could.

or instead of reboot 'mount -o remount,ro' - whatever fits...
Just be aware that relatively 'small' load on filesystem may easily provision
major portion of thin-pool quickly.

Maybe it would even be possible to have a kernel module that blocks a certain
kind of writes, but these things are hard, because the kernel doesn't have a
lot of places to hook onto by design. You could simply give the filesystem (or
actually the code calling for a write) write failures back.

There are no multiple write queues at dm level where you could select you want
to store data from LibreOffice, but you want to throw out your Firefox files...

what other people have said. That the system already does this (mounting
read-only).
I believe my test system just failed because the writes only took a few
seconds to fill up the volume. Not a very good test, sorry. I didn't realize
that, that it would check only in intervals.

dmeventd is quite quick when it 'detects' threshold (recent version of lvm2).

I still wonder what freezes my system like that.

Your 'write' queue (amount of dirty-pages) could be simply full of write to
'blocked' device, and without 'time-outing' writes (60sec) you can't write
anything anywhere else...

Worth to note here - you can set your thin-pool with 'instant' erroring in
case you know you do not plan to resize it (avoiding 'freeze')

lvcreate/lvchange --errorwhenfull y|n

Regards

Zdenek

Xen

2016-05-17 17:17:19 UTC

Strange, I didn't get my own message.

Post by Zdenek Kabelac
There is no plan ATM to support boot from thinLV in nearby future.
Just use small boot partition - it's the safest variant - it just hold
kernels and ramdisks...

That's not what I meant. Grub-probe will fail when the root filesystem
is on thin, thereby making impossible the regeneration of your grub
config files in /boot/grub.

It will try to find the device for mounted /, and not succeed.

Booting thin root is perfectly possible, ever since Kubuntu 14.10 at
least (at least januari 2015).

Post by Zdenek Kabelac
We aim for a system with boot from single 'linear' with individual
kernel + ramdisk.
It's simple, efficient and can be easily achieved with existing
tooling with some 'minor' improvements in dracut to easily allow
selection of system to be used with given kernel as you may prefer to
boot different thin snapshot of your root volume.

Sure but won't happen if grub-update bugs on thin root.

I'm not sure why we are talking about this now, or what I asked ;-).

Post by Zdenek Kabelac
Complexity of booting right from thin is very high with no obvious benefit.

I understand. I had not even been trying to achieve yet, although it has
or might have principal benefit, the way doing away with partitions
entirely (either msdos or gpt) has a benefit on its own.

But as you indicate, you can place boot on non-thin LVM just fine, so
there is not really that issue as you say.

But for me, a frozen volume would be vastly superior to the system locking up.

You miss the knowledge how the operating system works.
Your binary is 'mmap'-ed for a device. When the device holding binary
freezes, your binary may freeze (unless it is mlocked in memory).
So advice here is simple - if you want to run unfreezable system -
simply do not run this from a thin-volume.

I did not run from a thin-volume, that's the point.

In my test, the thin volumes were created on another harddisk. I created
a small partition, put a thin pool in it, put 3 thin volumes in it, and
then overfilled it to test what would happen.

At first nothing happened, but as I tried to read back from the volume
that had supposedly been written to, the entire system froze. My system
had no active partitions on that harddisk other than those 3 thin
volumes.

Post by Zdenek Kabelac
ATM there are some 'black holes' as filesystem were not deeply tested
in all corner cases which now could be 'easily' hit with thin usage.
This is getting improved - but advice "DO NOT" run thin-pool 100% still applies.

I understand.

Post by Zdenek Kabelac
The best advice we have - 'monitor' fullness - when it's above - stop
using such system and ensure there will be more space - there is
noone else to do this task for you - it's the price you pay for
overprovisioning.

The point is that not only as an admin (for my local systems) but also
as a developer, there is no point in continuing a situation that could
be mitigated by designing tools for this purpose.

There is no point for me if I can make this easier by automating tools
for performing these tasks, instead of doing them by hand. If I can
create tools or processes that do, what I would otherwise have needed to
do by hand, then there is no point in continuing to do it by hand. That
is the whole point of "automation" everywhere.

I am not going to be a martyr just for the sake of people saying that a
real admin would do everything by himself, by hand, by never sleeping
and setting alarm clocks every hour to check on his system, if you know
what I mean.

"Monitoring" and "stop using" is a process or mechanism that may very
well be encoded and be made default, at least for my own systems, but by
extension, if it works for me, maybe others can benefit as well.

I see no reason for remaining a spartan if I can use code to solve it as
well.

Just the fact that auto-unmount and auto-extend exists, means you do not
disagree with this.

Regards.

Post by Zdenek Kabelac
If you need something 'urgently' now - you could i.e. monitor your syslog
message for 'dmeventd' report and run i.e. 'reboot' in some case...

Well I guess I will just try to find time to develop that applet/widget
I mentioned.

Of course an automated mechanism would be nice. The issue is not
filesystem corruption. The issue is my system freezing entirely. I'd
like to prevent that. Meaning, if I were to change the thin dmeventd
module, to remount ro, it would probably already be solved for me, if I
recompile and can use the compiled version.

I am not clear why a forced lazy umount is better, but I am sure you
have your reason for it. It just seems that in many cases, an unwritable
but present (and accessible) filesystem is preferable to none at all.

Post by Zdenek Kabelac
or instead of reboot 'mount -o remount,ro' - whatever fits...
Just be aware that relatively 'small' load on filesystem may easily provision
major portion of thin-pool quickly.

Depending on size of pool, right. It remains a race against the clock.

There are no multiple write queues at dm level where you could select
you want to store data from LibreOffice, but you want to throw out
your Firefox files...

I do not mean any form of differentiation or distinction. I mean an
overall forced read only mode on all files, or at least all "growing",
for the entire volume (or filesystem on it) which would pretty much be
the equivalent of remount,ro. The only distinction you could ever
possibly want in there is to block "new growth" writes while allowing
writes to existing blocks. That is the only meaningful distinction I can
think of.

Of course, it would be pretty much equivalent to a standard mount -o
remount,ro, and would still depend on thin pool information.

Post by Zdenek Kabelac
dmeventd is quite quick when it 'detects' threshold (recent version of lvm2).

Right.

Post by Zdenek Kabelac
Your 'write' queue (amount of dirty-pages) could be simply full of
write to 'blocked' device, and without 'time-outing' writes (60sec)
you can't write anything anywhere else...

Roger that, so it is really a resource issue. Currently I am running
this here system off of a USB 2 stick. I can tell you. IO blocking
happens more than sunrays bouncing off walls in my house, and they do
that a lot, too.

Something as simple as "man command" may block the system for 10 seconds
or more. Often times everything stops responding. I can see the USB
stick working. And then after a while the system resumes as normal. I
have a read speed of 25MB/s but something is amiss with IO scheduling.

Post by Zdenek Kabelac
Worth to note here - you can set your thin-pool with 'instant'
erroring in case you know you do not plan to resize it (avoiding
'freeze')
lvcreate/lvchange --errorwhenfull y|n

Ah thank you, that could solve it. I will try again with the thin test
the moment I feel like rebooting again. The harddrive is still
available, haven't installed my system yet.

Maybe that should be the default for any system that does not have
autoextend configured.

Regards.

Zdenek Kabelac

2016-05-17 19:18:12 UTC

Post by Xen
Strange, I didn't get my own message.

Post by Zdenek Kabelac
There is no plan ATM to support boot from thinLV in nearby future.
Just use small boot partition - it's the safest variant - it just hold
kernels and ramdisks...

That's not what I meant. Grub-probe will fail when the root filesystem is on
thin, thereby making impossible the regeneration of your grub config files in
/boot/grub.
It will try to find the device for mounted /, and not succeed.
Booting thin root is perfectly possible, ever since Kubuntu 14.10 at least (at
least januari 2015).

Sure but won't happen if grub-update bugs on thin root.
I'm not sure why we are talking about this now, or what I asked ;-).

The message behind is - bootting from 'linear' LVs, and no msdos partions...
So right from a PV.
Grub giving you 'menu' from bootable LVs...
BootableLV combined with selected 'rootLV'...

Post by Zdenek Kabelac
Complexity of booting right from thin is very high with no obvious benefit.

I understand. I had not even been trying to achieve yet, although it has or
might have principal benefit, the way doing away with partitions entirely
(either msdos or gpt) has a benefit on its own.
But as you indicate, you can place boot on non-thin LVM just fine, so there is
not really that issue as you say.

But for me, a frozen volume would be vastly superior to the system locking up.

You miss the knowledge how the operating system works.
Your binary is 'mmap'-ed for a device. When the device holding binary
freezes, your binary may freeze (unless it is mlocked in memory).
So advice here is simple - if you want to run unfreezable system -
simply do not run this from a thin-volume.

It's the very same issue if you'd have used 'slow' USB device - you may slow
down whole linux usage - or in similar way building 4G .iso image.

My advice - try lowering /proc/sys/vm/dirty_ration - I'm using '5'....

Policies are hard and it's not quite easy to have some universal,
that fits everyone needs here.

On the other hand it's relatively easy to write some 'tooling' for your
particular needs - if you have nice 'walled' garden you could easily target it...

Post by Xen
"Monitoring" and "stop using" is a process or mechanism that may very well be
encoded and be made default, at least for my own systems, but by extension, if
it works for me, maybe others can benefit as well.

Yes - this part will be extended and improved over the time.
Already few BZ exists...
It just takes time....

Post by Xen
I am not clear why a forced lazy umount is better, but I am sure you have your
reason for it. It just seems that in many cases, an unwritable but present
(and accessible) filesystem is preferable to none at all.

Plain simplicity - umount is simple sys call, while 'mount -o remount,ro' is
relatively complicated resource consuming process. There are some technical
limitation related to usage operations like this behind 'dmeventd' - so it
needs some redesigning for these new needs....

Post by Xen
I do not mean any form of differentiation or distinction. I mean an overall
forced read only mode on all files, or at least all "growing", for the entire
volume (or filesystem on it) which would pretty much be the equivalent of
remount,ro. The only distinction you could ever possibly want in there is to
block "new growth" writes while allowing writes to existing blocks. That is
the only meaningful distinction I can think of.
Of course, it would be pretty much equivalent to a standard mount -o
remount,ro, and would still depend on thin pool information.

To give some 'light' where is the 'core of problem'

Imaging you have few thin LVs.
and you operate on a single one - which is almost fully provisioned
and just a single chunk needs to be provisioned.
And you fail to write. It's really nontrivial to decided what needs
to happen.

Ah thank you, that could solve it. I will try again with the thin test the
moment I feel like rebooting again. The harddrive is still available, haven't
installed my system yet.
Maybe that should be the default for any system that does not have autoextend
configured.

Yep policies, policies, policies....

Regards

Zdenek

Xen

2016-05-17 20:43:18 UTC

Post by Zdenek Kabelac
The message behind is - bootting from 'linear' LVs, and no msdos partions...
So right from a PV.
Grub giving you 'menu' from bootable LVs...
BootableLV combined with selected 'rootLV'...

I get it.

If that is the vision, I'm completely fine with that. I imagine everyone
would. That would be rather nice.

I'm not that much of a snapshot person, but still, there is nothing
really against it.

Andrei Borzenkov once told me on OpenSUSE list that there just (wasn't)
support for thin yet at all in grub at that point (maybe a year ago that
was?).

As I said I was working on an old patch to enable grub booting of PVs,
but Andrei hasn't been responsive for more than a week. Maybe I'm just
not very keen on all of this.

I don't know much about Grub, but I do know its lvm.c by heart now
almost :p.

So yeah, anyway.

Post by Xen
In my test, the thin volumes were created on another harddisk. I created a
small partition, put a thin pool in it, put 3 thin volumes in it, and then
overfilled it to test what would happen.

It's the very same issue if you'd have used 'slow' USB device - you
may slow down whole linux usage - or in similar way building 4G .iso
image.
My advice - try lowering /proc/sys/vm/dirty_ration - I'm using '5'....

Yeah yeah, slow down. I first have to test the immediate failure and no
waiting switch.

Post by Zdenek Kabelac
Policies are hard and it's not quite easy to have some universal,
that fits everyone needs here.

It depends on what people say they want.

In principle I don't think people would disagree with certain solutions
if that was default.

One of the things I don't think people would disagree with would be
having one of either of:

- autoextend and waiting with writes so nothing fails
- no autoextend and making stuff read-only.

I don't really think there are any other use cases. But like I
indicated, any advanced system would only error on "growth writes?

Post by Zdenek Kabelac
On the other hand it's relatively easy to write some 'tooling' for your
particular needs - if you have nice 'walled' garden you could easily target it...

Sure and that's how every universal solution starts. But sometimes
people just need to be convinced, and sometimes they need to convinced
by seeing a working system and tests or statistics of whatever kind.

Yes - this part will be extended and improved over the time.
Already few BZ exists...
It just takes time....

Alright. BugZilla is just for me not very amenable to /positive
changes/, it seems so much geared towards /negative bugs/ if you know
what I mean. Myself I would like to use more of Jira (Atlassian) but I
did not say that ;-).

Post by Zdenek Kabelac
Plain simplicity - umount is simple sys call, while 'mount -o
remount,ro' is relatively complicated resource consuming process.
There are some technical limitation related to usage operations like
this behind 'dmeventd' - so it needs some redesigning for these new
needs....

Okay. I thought it would be equivalent because both are called not as a
system call, but it actually loads /bin/umount.

I guess that might mean you would need to trigger even another process,
but you seem to be on top of it.

I would probably just blatantly get another daemon running, but I don't
really have the skills for this yet. (I'm just approaching it from a
quick & dirty perspective, as soon as I can get it running, at least I
have a test system, proof of concept, or something that works).

Post by Zdenek Kabelac
To give some 'light' where is the 'core of problem'
Imaging you have few thin LVs.
and you operate on a single one - which is almost fully provisioned
and just a single chunk needs to be provisioned.
And you fail to write. It's really nontrivial to decided what needs
to happen.

First what I proposed would be for every thin volume to have a spare
chunk. But maybe that's irrelevant here.

So there are two different cases as mentioned: existing block writes,
and new block writes. What I was gabbing about earlier would be forcing
a filesystem to also be able to distuinguish between them. You would
have a filesystem-level "no extend" mode or "no allocate" mode that gets
triggered. Initially my thought was to have this get triggered trough
the FS-LVM interface. But, it could also be made operational not through
any membrane but simply by having a kernel (module) that gets passed
this information. In both cases the idea is to say: the filesystem can
do what it wants with existing blocks, but it cannot get new ones.

When you say "it is nontrivial to decide what needs to happen" what you
mean is: what should happen to the other volumes in conjunction to the
one that just failed a write (allocation).

To begin with this is a problem situation to begin with, so programs, or
system calls, erroring out, is expected and desirable, right.

So there are only three, four, five different cases:

- kernel informs VFS that all writes to all thin volumes should fail
- kernel informs VFS that all writes to new blocks on thin volumes
should fail (not sure if it can know this)
- filesystem gets notified that new block allocation is not going to
work, deal with it
- filesystem gets notified that all writes should cease (remount ro, in
essence), deal with it.

Personally, I prefer the 3rd of these four.

Personally, I feel the condition of a filesystem getting into a "cannot
allocate" state, is superior.

That would be a very powerful feature. Earlier I talked about all of
this communication between the block layer and the filesystem layer
right. But in this case it is just one flag, and it doesn't have the
traverse the block-FS barrier.

However, it does mean the filesystem must know the 'hidden geometry'
beneath its own blocks, so that it can know about stuff that won't work
anymore.

However in this case it needs no other information. It is just a state.
It knows: my block devices has 4M blocks (for instance), I cannot get
new ones (or if I try, mayhem can ensue) and now I just need to
indiscriminately fail writes that would require new blocks, try to
redirect them to existing ones, let all existing-block writes continue
as usual, and overall just fail a lot of stuff that would require new
room.

Then of course your applications are still going to fail but that is the
whole point. I'm not sure if the benefit is that outstanding as opposed
to complete read-only, but it is very clear:

* In your example, the last block of the entire thin pool is now gone
* In your example, no other thin LV can get new blocks (extents, chunks)
* In your example, all thin LVs would need to start blocking writes to
new chunks in case there is no autoextend, or possibly delay them if
there is.

That seems pretty trivial. The mechanic for it may not. It is preferable
in my view if the filesystem was notified about it and would not even
*try* to write new blocks anymore. Then, it can immediately signal
userspace processes (programs) about writes starting to fail.

Will mention that I still haven't tested --errorwhenfull yet.

But this solution does seem to indicate you would need to either get all
filesystems to either plainly block all new allocations, or be smart
about it. Doesn't make a big difference.

In principle if you had the means to acquire such a
flag/state/condition, and the filesystem would be able to block new
allocation wherever whenever, you would already have a working system.
So what is then non-trivial?

The only case that is really nontrivial is that if you have autoextend.
But even that you already have implemented.

It seems completely obvious that to me at this point, if anything from
LVM (or e.g. dmeventd) could signal every filesystem on every affected
thin volume, to enter a do-not-allocate state, and filesystems would be
able to fail writes based on that, you would already have a solution
right?

It would be a special kind of read-only. It would basically be a third
state, after read-only, and read-write.

But it would need to be something that can take affect NOW. It would be
a kind of degraded state. Some kind of emergency flag that says: sorry,
certain things are going to bug out now. If the filesystem is very
smart, it might still work for a while as old blocks are getting filled.
If not, new allocations will fail and writes will ....somewhat randomly
start to fail.

Certain things might continue working, others may not. Most applications
would need to deal with that by themselves, which would normally have to
be the case anyway. Ie. all over the field applications may start to
fail. But that is what you want right. That is the only sensible thing.

If you have no autoextend.

That would normally mean that filesystem operations such as DELETE would
still work, ie. you keep a running system on which you can remove files
and make space.

That seems to be about as graceful as it can get. Right? Am I wrong?

Post by Xen
Maybe that should be the default for any system that does not have autoextend
configured.

Yep policies, policies, policies....

Sounds like you could use a nice vacation in a bubble bath with nice
champagne and good lighting, maybe a scented room, and no work for t
least a week ;-).

And maybe some lovely ladies ;-) :P.

Personally I don't have the time for that, but I wouldn't say no to the
ladies tbh.

Anyway let me just first test --errorwhenfull for you, or at least, for
myself, to see if that completely solves the issue I had okay.

Regards and thanks for responding,

B.

Zdenek Kabelac

2016-05-17 22:26:23 UTC

I don't know much about Grub, but I do know its lvm.c by heart now almost :p.

lvm.c by grub is mostly useless...

One of the things I don't think people would disagree with would be having one
- autoextend and waiting with writes so nothing fails
- no autoextend and making stuff read-only.

ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in loop.....

Alright. BugZilla is just for me not very amenable to /positive changes/, it
seems so much geared towards /negative bugs/ if you know what I mean. Myself I
would like to use more of Jira (Atlassian) but I did not say that ;-).

We call them 'Request For Enhancements' BZ....

First what I proposed would be for every thin volume to have a spare chunk.
But maybe that's irrelevant here.

Well the question was not asking for your 'technical' proposal, as you have no
real idea how it works and your visions/estimations/guesses have no use at all
(trust me - far deeper thinking was considered so don't even waste your time
to write those sentences...)

Also forget you write a new FS - thinLV is block device so there is no such
think like 'fs allocates' space on device - this space is meant to be there....

When you say "it is nontrivial to decide what needs to happen" what you mean
is: what should happen to the other volumes in conjunction to the one that
just failed a write (allocation).

Rather think in terms:

You have 2 thinLVs.

Origin + snapshot.

You write to origin - and you miss to write a block.

Such block may be located in 'fs' journal, it might be a 'data' block,
or fs metadata block.

Each case may have different consequences.

When you fail to write an ordinary (non-thin) block device - this block is
then usually 'unreadable/error' - but in thinLV case - upon read you get
previous 100% valid' content - so you may start to imagine where it's all heading.

Basically solving these troubles when pool is 'full' is 'too late'.
If user wants something 'reliable' - he needs to use different thresholds -
i.e. stopping at 90%....

But other users might be 'happy' with missing block (failing write area) and
rather continue to use 'fs'....

You have many things to consider - but if you make policies too complex,
users will not be able to use it.

Users are already confused with 'simple' lvm.conf options like
'issue_discards'....

Personally, I feel the condition of a filesystem getting into a "cannot
allocate" state, is superior.

As said - there is no thin-volume filesystem.

However in this case it needs no other information. It is just a state. It
knows: my block devices has 4M blocks (for instance), I cannot get new ones

Your thinking is from 'msdos' era - single process, single user.

You have multiple thin volumes active, with multiple different users all
running their jobs in parallel and you do not want to stop every user when you
are recomputing space in pool.

There is really no much point in explaining further details unless you are
willing to spend your time understanding deeply surrounding details.

* In your example, the last block of the entire thin pool is now gone
* In your example, no other thin LV can get new blocks (extents, chunks)
* In your example, all thin LVs would need to start blocking writes to new
chunks in case there is no autoextend, or possibly delay them if there is.
That seems pretty trivial. The mechanic for it may not. It is preferable in my
view if the filesystem was notified about it and would not even *try* to write

There is no 'try' operation.

It would probably O^2 complicate everything - and the performance would
drop by major factor - as you would need to handle cancellation....

new blocks anymore. Then, it can immediately signal userspace processes
(programs) about writes starting to fail.

For simplicity here - just think about failing 'thin' write as a disk with
'write' errors, however upon read you get last written content....

Will mention that I still haven't tested --errorwhenfull yet.
But this solution does seem to indicate you would need to either get all
filesystems to either plainly block all new allocations, or be smart about it.
Doesn't make a big difference.

'extX' will switch to 'ro' upon write failure (when configured this way).

'XFS' in 'most' cases now will shutdown itself as well (being improved)

extX is better since user may still continue to use it at least in read-only
mode...

It seems completely obvious that to me at this point, if anything from LVM (or
e.g. dmeventd) could signal every filesystem on every affected thin volume, to
enter a do-not-allocate state, and filesystems would be able to fail writes
based on that, you would already have a solution right?

'bash' loop...

It would be a special kind of read-only. It would basically be a third state,
after read-only, and read-write.

Remember - not writing 'new' fs....

But it would need to be something that can take affect NOW. It would be a kind
of degraded state. Some kind of emergency flag that says: sorry, certain
things are going to bug out now. If the filesystem is very smart, it might
still work for a while as old blocks are getting filled. If not, new
allocations will fail and writes will ....somewhat randomly start to fail.

You are preparing for lost battle.
Full pool is simply not a full fs.
And thin-pool may get out-of-data or out-of-metadata....

That would normally mean that filesystem operations such as DELETE would still

You really need to sit and think for a while what the snapshot and COW does
really mean, and what is all written into a filesystem (included with
journal) when you delete a file.

work, ie. you keep a running system on which you can remove files and make space.
That seems to be about as graceful as it can get. Right? Am I wrong?

Wrong...

But on of our 'polices' visions are to also use 'fstrim' when some threshold
is reached or before thin snapshot is taken...

Z.

Xen

2016-05-18 01:34:59 UTC

I don't know much about Grub, but I do know its lvm.c by heart now almost :p.

lvm.c by grub is mostly useless...

Then I feel we should take it out and not have grub capable of booting
LVM volumes anymore at all, right.

One of the things I don't think people would disagree with would be having one
- autoextend and waiting with writes so nothing fails
- no autoextend and making stuff read-only.

ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in loop.....

So you are saying every user of thin LVM must individually, that means
if there are a 10.000 users, you now have 10.000 people needing to write
the same thing, while first having to acquire the knowledge of how to do
it.

I take it by that loop you mean a sleep loop. It might also be that
logtail thing and then check for the dmeventd error messages in syslog.
Right? And then when you find this message, you remount ro. You have to
test a bit to make sure it works and then you are up and running. But
this does imply that this thing is only available to die-hard users. You
first have to be aware of what is going to happen. I tell you, there is
really not a lot of good documentation on LVM okay. I know there is that
LVM book. Let me get it....

First hit is CentOS. Second link is reddit. Third link is Redhat. Okay
it should be "lvm guide" not "lvm book". Hasn't been updated since 2006
and no advanced information other than how to compile and install....

I mean: http://tldp.org/HOWTO/LVM-HOWTO/. So what people are really
going to know this stuff except the ones that are on this list?

Unless you experiment, you won't know what will happen to begin with.
For instance, different topic, but it was impossible to find any real
information on LVM cache.

So now you want every single admin to have the knowledge (that you
obviously do have, but you are its writers and mainters, its gods and
cohorts) to create a manual script, no matter how simple, that will
check the syslog, that you can only really know about by checking the
fucking source or running tests and then see what happens (and be smart
enough to check syslog) -- and then of course to write either a service
file for this script or put it in some form of rc.local.

Well that latter is easy enough even on my system (I was not even sure
whether that existed here :p).

But knowing about this stuff doesn't come by itself. You know. This
doesn't just fall from the sky.

I would probably be more than happy to write documentation at some point
(because I guess I did go through all of that to learn, and maybe others
shouldn't or won't have to?) but without this documentation, or this
person leading the way, this is not easy stuff.

Also "info" still sucks on Linux, the only really available resource
that is easy to use are man pages. It took me quite some time to learn
about all the available lvm commands to begin with (without reading a
encompassing manual) and imagine my horror when I was used to
Debian/Ubuntu systems automatically activating the vg upon opening a
LUKS container, but then the OpenSUSE rescue environment not doing that.

How to find out about vgchange -ay without having internet
access.........

It was impossible.

So for me it has been a hard road to begin with and I am still learning.

In fact I *had* read about vgchange -ay but that was months prior and I
had forgotten. Yes, bad sysadmin.

Every piece of effort a user can take on his own, is a piece of effort
that can be prevented by a developer or even possibly a (documentation)
writer if such a thing could exist. And I know I can't do it yet, if
that is what you are asking or thinking.

Post by Zdenek Kabelac
We call them 'Request For Enhancements' BZ....

You mean you have a non-special non-category that only distinguishes
itself by having a [RFE] tag in the bug name, and that is your special
feature? (laughs a bit).

I mean I'm not saying it has to be anything special and if you have a
small system maybe that is enough.

But Bugzilla is just not an agreeable space to really inspire or invite
positive feedback like that.... I mean I too have been using bugzillas
for maybe a decade or longer. Not as a developer mostly, as a user. And
the thing is just a cynical place. I mean, LOOK at Jira:

https://issues.apache.org/jira/browse/log4j2/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel

Just an example. A "bug" is just one out of many categories. They have
issue types for Improvements, Brainstorming, New Feature, Question,
Story, and Wish. It is so entirely inviting to do whatever you want to
do. In BugZilla, a feature request is still just a bug. And in your
RedHat system, you just have added some field called "doc type" that
you've set to "enhancement" but that's it.

And a bug is a failure, it is a fault. The system is not meant for
positive feedback, only negative feedback in that sense. The user
experience of it is just vastly detrimental compared to that other
thing....

Well I didn't really want to go into this, but since you invited it
:pp....

But it is also meant for the coming thing. And I apologize.

First what I proposed would be for every thin volume to have a spare chunk.
But maybe that's irrelevant here.

Well you can drop the attitude you know. If you were doing so great, you
would not be having a total lack of all useful documentation to begin
with. You would not have a system that can freeze the entire system by
default, because "policy" is apparently not well done.

You would not be having to debate how to make the system even a little
bit safer, and excuse yourself every three lines by saying that it's the
admin's job to monitor his system, not your job to make sure he doesn't
need to do all that much, or your job to make sure the system is
fail-safe to begin with.

I mean I understand that it is a work in progress. But then don't act
like it is finished, or that it is perfect provided the administrator is
perfect too.

If I'm trying to do anything here, it is to point out that the system is
quite lacking by default. You say "policy, policy, policy" as though you
are very tired. And maybe I'm a bit less so, I don't know. And I know it
can be tiresome to have to make these... call them fine-tunements to
make sure they work well by default on every system. Especially, I don't
know. If it is a work in progress and not meant to be used by people not
willing to invest as much as you have (so to speak).

And I'm not saying you are doing a bad job in developing this. I think
LVM is one of the more sane systems existing in the Linux world today. I
mean, I wouldn't be here if I didn't like it, or if I wasn't grateful
for your work.

I think the commands themselves and their way of being used, is
outstanding, they are intuitive, they are much better than many other
systems out there (think mdadm). It takes hardly no pain to remember how
to use e.g. lvcreate, or vgcreate, or whatever. It is intuitive, it is
nice, sometimes you need a little lookup, and that is fast too. It is a
bliss to use compared to other systems certainly. Many of the
rudimentary things are possible, and the system is so nicely modular and
layered that it is always obvious what you need to do at whatever point.

Post by Zdenek Kabelac
Also forget you write a new FS - thinLV is block device so there is no
such think like 'fs allocates' space on device - this space is meant
to be there....

In this case, provided indeed none of that would happen (that we talked
about earlier) the filesystem doesn't NEED to allocate anything, but it
DOES know which part of the block space it already has in use and which
parts it doesn't, and if it is aware of this, and if it is aware of the
"real block size" of the underlying device provided it did do a form of
allocation (as does LVM thin) then suddenly it doesn't NEED to know
about this allocation other than to know that it is happening, and it
only needs to know the alignment of the real blocks.

Of course that means some knowledge of the underlying the device, but as
has been said earlier (by that other guy that supported it) this
knowledge is already there at some level and it would not be that weird.

Yes it is that "integration" you so despise.

You are *already* integrating e.g. extfs to more closely honour the
extent boundaries so that it is more efficient. What I am saying is not
at all out of the ordinary with that. You could not optimize if the
filesystem did not know about alignment, and if it could not "direct"
'allocation' into those aligned areas. So the filesystem already knows
what is going to happen down beneath, and it has the knowledge to choose
not to write to new areas unless it has to. You *told* me so.

That means it can also choose not to write to any NEW "aligned" blocks.

So you are just being principial here. You attack the idea based on the
fact that "there is no real allocation taking place of the block device
by the filesystem". But if you drop the word, there is no reason to
disagree with what I said.

The filesystem KNOWS allocation is getting done (or it could know) and
if it knows about the block alignment of those extents, then it does not
NEED to have intimate knowledge of the ACTUAL allocation getting done by
the thin volume in the thin pool.

So what are you really disagreeing with here? You are just being
pedantic right? You could tell the filesystem to enter
no-allocation-mode or no-write-to-new-areas-mode (same thing here) or
"no-cause-allocation-mode" (same thing here).

And it would work.

Even if you disagree with the term, it would still work. At least, as
far as we go here.

You never said it wouldn't work. You just disagreed with my use of
wording.

Post by Zdenek Kabelac
You have 2 thinLVs.
Origin + snapshot.
You write to origin - and you miss to write a block.
Such block may be located in 'fs' journal, it might be a 'data' block,
or fs metadata block.
Each case may have different consequences.

But that is for the filesystem to decide. The thin volume will not know
about the filesystem. In that sense. Layers, remember?

Post by Zdenek Kabelac
When you fail to write an ordinary (non-thin) block device - this
block is then usually 'unreadable/error' - but in thinLV case - upon
read you get previous 100% valid' content - so you may start to
imagine where it's all heading.

So you mean that "unreadable/error" signifies some form of "bad sector"
error. But if you fail to write to thinLV, doesn't that mean (in our
case there) that the block was not allocated by thinLV? That means you
cannot read from it either. Maybe bad example, I don't know.

Post by Zdenek Kabelac
Basically solving these troubles when pool is 'full' is 'too late'.
If user wants something 'reliable' - he needs to use different thresholds -
i.e. stopping at 90%....

Well I will try to look into it more when I have time. But I don't
believe you. I don't see a reason from the outset why it should or would
need to be so. There should be no reason a write fails unless an
allocate fails. So how could you ever read from it (unless you read
random or white data). And, provided the filesystem does try to read
from it; why would it do so if its write failed before that?

Maybe that is what you alluded to before, but a filesystem should be
able to solve that on its own without knowing those details I think. I
believe quite usually inodes are written in advance? They are not
growth-scenarios. So this metadata cannot fail to write due to a failed
block level allocate. But even that should be irrelevant for thin LVM
itself.....

Post by Zdenek Kabelac
But other users might be 'happy' with missing block (failing write
area) and rather continue to use 'fs'....

But now you are talking about human users. You are now talking about an
individual that tries to write to a thin LV, it doesn't work because the
thing is full, and he/she wants to continue to use the 'fs'. But that is
what I proposed right. If you have a fail-safe system, if you have a
system that keeps functioning even though it blocks growth writes, then
you have the best of both worlds. You have both.

It is not either/or. What I was talking about is both. You have
reliability and you can keep using the filesystem. The filesystem just
needs to be able to cope with the condition that it cannot use any new
blocks from the existing pool that it knows about. That is not extremely
very different from having exhausted its block pool to begin with. It is
really the same condition, except right now it is rather artificial.

You artificially tell the FS: you are out of space. Or, you may not use
new (alignment) blocks. It is no different from having no free blocks at
all. The FS could deal with it in the same way.

Post by Zdenek Kabelac
You have many things to consider - but if you make policies too complex,
users will not be able to use it.
Users are already confused with 'simple' lvm.conf options like
'issue_discards'....

I understand. But that is why you create reasonable defaults that work
well together. I mean, I am not telling you you can't, or have done a
bad job in the past, or are doing a bad job now.

But I'm talking mostly about defaults. And right now I was really only
proposing this idea of a filesystem state that says "Me, the filesystem,
will not allocate any new blocks for data that are in alignment with the
underlying block device. I will not use any new (extents) from my block
device even though normally they would be available to me. I have just
been told there might be an issue, and even though I don't know why, I
will just accept that and try not to write there anymore".

It is really the simplest idea there can be here. If you didn't have
thin, and the filesystem was full, you'd have the same condition.

It is just a "stop expanding" flag.

Post by Zdenek Kabelac
Personally, I feel the condition of a filesystem getting into a "cannot

allocate" state, is superior.

As said - there is no thin-volume filesystem.

Can you just cut that, you know. I know the filesystem does not
allocate. But it does know, or can know, allocation will happen. It
might be aware of the "thin" nature, and even if it didn't, it could
still honour such a flag even if it wouldn't make sense for it.

However in this case it needs no other information. It is just a state. It
knows: my block devices has 4M blocks (for instance), I cannot get new ones

Your thinking is from 'msdos' era - single process, single user.
You have multiple thin volumes active, with multiple different users
all running their jobs in parallel and you do not want to stop every
user when you are recomputing space in pool.
There is really no much point in explaining further details unless you are
willing to spend your time understanding deeply surrounding details.

You are using details to escape the necessity that the overlying or
encompassing framework dictates that things do currently not work.

That is like using the trees to say that there is no forest.

Or not seeing the forest for the trees. That is exactly what it means. I
know I am a child here. But do not ignore the wisdom of a child. The
child knows more than you do. Even if it has much less data than you do.

The whole reason a child *can* know more is because it has less data.
Because of that, it can still see the outline, while you may no longer
be able to, because you are deep within the forest.

That's exactly what that saying means.

If you see planet earth from space and you see that it is turning or
maybe you can see its ice caps are melting. And then someone on earth
says "No that is not happening because such and such is so". Who is
right? The one with the overview, or the one with the details?

An outsider can often perceive directly what is the nature of something.
Only at the outside, of course. But he/she can clearly see whether it is
left or right, big or small, cold or hot. It may not know why it is
being hot or cold, but it does know that it is being cold or hot. And
the outsider may see there should be no reason why something cannot be
so.

If details are in the way, change the details.

By the above, with "user" you seem to mean a real human user. But a
filesystem queues requests, it does not have multiple users. It needs to
schedule whatever it is doing, but it all has to go through the same
channel, ending up on the same disk. So from this perspective, the only
relevant users are the various filesystems. This must be so, because if
two operating systems mount the same block device twice, you get mayhem.
So the filesystem driver is the channel. Whether it is one multitasking
process or multiple users doing the same thing, is irrelevant. Jobs, in
this sense, are also irrelevant. What is relevant is writes to different
parts, or reads from different parts.

But supposing those multiple users are multiple filesystems using the
same thin pool. Okay you have a point, perhaps. And indeed I do not know
about any delays in space calculations. I am just approaching this from
the perspective of a designer. I would not design it such that the data
on the amount of free extents, would at any one time be unavailable. It
should be available to all at any one time. It is just a number. It does
not or should not need recomputation. I am sorry if that is incorrect
here. If it does need recomputation, then of course what you say makes
sense (even to me) and that you need a time window to prepare for
disaster; to anticipate.

I don't see why a value like the number of free extents in a pool would
need recomputation though, but that is just me. Even if you had
concurrent writes (allocations/expansions) you should be able to deal
with that, people do that all the time.

The number of free extents is simply a given at any one time right?
Unless freeing them is a more involved operation. I'm just trying to
show you that there shouldn't need to be any problems here with this
idea.

Allocations should be atomic and even if they are concurrent, the
updating of this information shouldn't be concurrent. It is a single
number, only one person can change it at a time. It's a single number,
even if you wrote 10 million blocks concurrently, your system should be
able to change/increment that number 10 million times in the same time.

Right? I know you will say wrong. But this seems out of the ordinarily
strange to me.

I mean I am still wholly unaware of how concurrency works in the kernel
(except that I know the terms) (because I've been reading some code)
(such as RCU, refcount, spinlock, mutex, what else) but I doubt this
would be a real issue if you did it right, but that's just me.

If you can concurrently traverse data structures and keep everything
working in pristine order, you know, why shouldn't you be able to
'concurrently' update a number.

Maybe that's stupid of me, but it just doesn't make sense to me.

That seems pretty trivial. The mechanic for it may not. It is
preferable in my
view if the filesystem was notified about it and would not even *try* to write

There is no 'try' operation.

You have seen Star Wars too much. That statement is misunderstood, Yoda
tells a falsehood there.

There is a write operation that can fail or not fail.

Post by Zdenek Kabelac
It would probably O^2 complicate everything - and the performance would
drop by major factor - as you would need to handle cancellation....

Can you only think in troubles and worries? :P. I see you mean (I think)
that some writes would succeed and some would fail and that that would
complicate things? Other than that there is not much difference with a
read-only filesystem right?

A filesystem that cannot even write to any new blocks is dead anyway.
Why worry about performance in any case. It's a form of read-only mode
or space-full mode that is not very different from existing modes. It's
a single flag. Some writes succeed, some writes fail. System is almost
dead to begin with, space is gone. Applications start to crash left and
right. But at least the system survives.

Not sure what cancellation you are talking about or if you understood
what I said before.....

Post by Zdenek Kabelac
For simplicity here - just think about failing 'thin' write as a disk
with 'write' errors, however upon read you get last written
content....

So? And I still cannot see how that would happen. If the filesystem had
not actually written to a certain area, it would also not try to read,
right? Otherwise, the whole idea of "lazy allocation" of extents is
impossible. I don't actually know what happens if you "read" the entire
thin LV, and you could, but blocks that have never been allocated (by
thin LV) should just return zero. I don't think anything else would
happen?

I mean, there we go again: And of course the file contains nothing but
zeroes, duh. Reading from a "nonwritten" extent just returns zero space.
Obvious.

There is no reason why a thin write should fail if it has succeeded
before to the same area. I mean, what's the issue here, you don't really
explain. Anyway I am grateful for your time explaining this, but it just
does not make much sense.

Then you can say "Oh I give up", but still, it does not make much sense.

Post by Zdenek Kabelac
'extX' will switch to 'ro' upon write failure (when configured this way).

Ah, you mean errors=remount-ro. Let me see what my default is :p. (The
man page does not mention the default, very nice....).

Oh, it is continue by default. Obvious....

In any case that means if it did have a 3rd mount option type (like rw,
ro, .....rp for "read/partial" ;-)).

It could also remount rp on errors ;-).

Thanks for the pointers all.

Post by Zdenek Kabelac
'XFS' in 'most' cases now will shutdown itself as well (being improved)
extX is better since user may still continue to use it at least in
read-only mode...

Thanks. That is very welcome. But I need to be a complete expert to be
able to use this thing. I will write a manual later :p. (If I'm still
alive).

'bash' loop...

I guess your --errorwhenfull y, combined with tunefs -e remount-ro,
would also do the trick, but that works on ALL filesystem errors.

Like I said, I haven't tested it yet. Maybe we are covering nonsensical
ground here.

But a bash loop is no solution for a real system.....

Yes thanks for pointing it out to me. But this email is getting way too
long for me.

Anyway, we are also converging on the solution I'd like, so thank you
for your time here regardless.

Post by Zdenek Kabelac
Remember - not writing 'new' fs....

Never said I was. New state for existing fs.

Post by Zdenek Kabelac
You are preparing for lost battle.
Full pool is simply not a full fs.
And thin-pool may get out-of-data or out-of-metadata....

Does not have to be any different when the filesystem thinks and says it
is full.

You are not going from full pool to full filesystem. The filesystem is
not even full.

You are going from full pool, to a message to filesystems to enter
no-expand-mode (no-allocate-mode), which will then simply cease growing
into new "aligned" blocks.

What does it even MEAN to say that the two are not identical? I never
talked about the two being identical. It is just an expansion freeze.

That would normally mean that filesystem operations such as DELETE would still

You really need to sit and think for a while what the snapshot and COW
does really mean, and what is all written into a filesystem (included
with journal) when you delete a file.

Too tired now. I don't think deleting files requires growth of
filesystem. I can delete files on a full fs just fine.

You mean a deletion on origin can cause allocation on snapshot.

Still that is not a filesystem thing, that is a thin-pool thing.

That is something for LVM to handle. I don't think this delete would
fail, would it? If the snapshot is a block thing, it could write the
changed inodes of the file and its directory.... it would only overwrite
the actual data if that block was overwritten on origin.

So you run the risk of extent allocation for inodes.

But you have this problem today as well. It means clearing space could
possibly need or would possibly need a work buffer. Some workspace.

You would need to pre-allocate space for the snapshot, as a practical
measure. But that's not really a real solution.

The real solution is to buffer it in memory. If the deletes free space,
you get free extents that you can use to write the memory buffered data
(metadata). That's the only way to deal with that. You are just talking
inodes (and possibly journal).

(But then how is the snapshot going to know these are deletes. In any
case, you'd have the same problems with regular writes to origin. So I
guess with snapshots you run into more troubles?

I guess with snapshots you either drops the snapshots or freeze the
entire filesystem/volume? Then how will you delete anything?

You would either have to drop a snapshot, drop a thin volume, or copy
the data first and then do that.

Right?

Too tired.

Post by Zdenek Kabelac
But on of our 'polices' visions are to also use 'fstrim' when some
threshold is reached or before thin snapshot is taken...

A discard filesystem (mounted discard) will automatically do that right,
with a slight delay, so to speak.

I guess it would be good to do that, or warn the user to mount with
"discard" option.

Zdenek Kabelac

2016-05-18 12:15:24 UTC

I don't know much about Grub, but I do know its lvm.c by heart now almost :p.

lvm.c by grub is mostly useless...

Then I feel we should take it out and not have grub capable of booting LVM
volumes anymore at all, right.

It's not properly parsing and building lvm2 metadata - it's a 'reverse
engineered' code to handle couple 'most common' metadata layouts.

But it happens most users are happy with it.

So for now using 'boot' partition is advised until proper lvm2 metadata
parser becomes integral part of Grub.

Post by Zdenek Kabelac
ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in loop.....

So you are saying every user of thin LVM must individually, that means if
there are a 10.000 users, you now have 10.000 people needing to write the same

Only very few of them will write something - and they may propose their
scripts for upstream inclusion...

I take it by that loop you mean a sleep loop. It might also be that logtail
thing and then check for the dmeventd error messages in syslog. Right? And

dmeventd is also 'sleep loop' in this sense (although smarter...)

First hit is CentOS. Second link is reddit. Third link is Redhat. Okay it
should be "lvm guide" not "lvm book". Hasn't been updated since 2006 and no
advanced information other than how to compile and install....

Dammed Google, he knows about you, that you like Centos and reddit :)
I do get quite different set of links :)

I mean: http://tldp.org/HOWTO/LVM-HOWTO/. So what people are really going to
know this stuff except the ones that are on this list?

We do maintain man pages - not feeling responsible for any HOWTO/blogs around
the world.

And of course you can learn a lot here as well:
https://access.redhat.com/documentation/en/red-hat-enterprise-linux/

How to find out about vgchange -ay without having internet access.........

Now just imagine you would need to configure your network from command line
with broken NetworkManager package....

maybe a decade or longer. Not as a developer mostly, as a user. And the thing
https://issues.apache.org/jira/browse/log4j2/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel

Being cynical myself - unsure what's better in URL name issues.apache.org
compared bugzilla.redhat.com... Obviously we do have all sorts of flags in RHBZ.

Post by Zdenek Kabelac
Well the question was not asking for your 'technical' proposal, as you
have no real idea how it works and your visions/estimations/guesses
have no use at all (trust me - far deeper thinking was considered so
don't even waste your time to write those sentences...)

Yep - and you probably think you help us a lot to realize this...

But you may a bit 'calm down' - we really know all the troubles and even far
more then you can even think of - and surprise - we actively work on them.

I think the commands themselves and their way of being used, is outstanding,
they are intuitive, they are much better than many other systems out there
(think mdadm). It takes hardly no pain to remember how to use e.g. lvcreate,

Design simply takes time - and many things are tried...

Of course Red Hat could have been cooking something for 10 years secretly
before going public - but the philosophy is - upstream first, release often
and only released code does matter.

So yeah - some people are writing novels on lists and some others are writing
a useful code....

You are *already* integrating e.g. extfs to more closely honour the extent
boundaries so that it is more efficient. What I am saying is not at all out of

There is a fundamental difference to 'read' geometry once during 'mkfs' time,
and do it every time we each write through the whole device stack ;)

So you mean that "unreadable/error" signifies some form of "bad sector" error.
But if you fail to write to thinLV, doesn't that mean (in our case there) that
the block was not allocated by thinLV? That means you cannot read from it
either. Maybe bad example, I don't know.

I think we are heading to big 'reveal' how thinp works.

You have thin volume T and its snapshot S.

You write to block 10 of device T.

As there is snapshot S - your write to device T needs to go to a newly
provisioned thin-pool chunk.

You get 'write-error' back (no more free chunks)

On read of block 10 you get perfectly valid existing content of block 10.
(and this applies to both volumes T & S).

And then you realize - that this 'write of block 10' means - you were just
updating some 'existing' file in filesystem or even filesystem journal..
There was no 'new' block allocation at filesystem level - filesystem was
writing to the 'space' it's believed it's been already assigned to him.

So I assume maybe now some 'spark' in you head may finally appear....

It is not either/or. What I was talking about is both. You have reliability
and you can keep using the filesystem. The filesystem just needs to be able to
cope with the condition that it cannot use any new blocks from the existing
pool that it knows about. That is not extremely very different from having
exhausted its block pool to begin with. It is really the same condition,
except right now it is rather artificial.

Wondering how long will it take before you realize - this is exactly what
the 'threshold' is about.

e.g. you know you are 90% full - so stop using fs - unmount it, remount it,
shutdown it, add new space - whatever - but it needs to be admin to decide...

....
deleted large piece of nonsense text here
....

I mean I am still wholly unaware of how concurrency works in the kernel
(except that I know the terms) (because I've been reading some code) (such as
RCU, refcount, spinlock, mutex, what else) but I doubt this would be a real
issue if you did it right, but that's just me.

You need to read some books how does modern OS works (instead of creating
hour lengthy emails) and learn what really means there is a 'parallel work' in
progress on a single machine with e.g. 128 CPU cores...

If you can concurrently traverse data structures and keep everything working
in pristine order, you know, why shouldn't you be able to 'concurrently'
update a number.

What you effectively say here you have 'invented' excellent bug fix, you just
need to serialize and synchronize all writes first in your OS.

To give it 'a real world' example - you would need to degrade your linux
kernel to not use page cache and use all writes in a way like:

dd if=XXX of=/my/thin/volume bs=512 oflag=direct,sync

Maybe that's stupid of me, but it just doesn't make sense to me.

see above...

But as said - this way it has worked in 'msdos' 198X era...

Then you can say "Oh I give up", but still, it does not make much sense.

My only goal here is to give you enough info to stop writing
emails with no real value in it and rather writing more useful code or doc
instead...

Post by Zdenek Kabelac
'extX' will switch to 'ro' upon write failure (when configured this way).

Ah, you mean errors=remount-ro. Let me see what my default is :p. (The man
page does not mention the default, very nice....).
Oh, it is continue by default. Obvious....

Common issue here is - one user prefers A other prefers B - that's
why we have options and users should read doc - as tools themselves
are not smart enough to figure out which fits better....

If you would ask me - 'remount,ro' is the only sane variant,
And I've learned this 'hard way' with my first failing HDD in ~199X,
where I've destroyed 50% of my data first....
(I do believe in Fedora you get remount,ro in fstab)

But a bash loop is no solution for a real system.....

Sure if you write this loop in JBoss it sounds way more cool :)
Whatever fits...

That would normally mean that filesystem operations such as DELETE would still

You really need to sit and think for a while what the snapshot and COW
does really mean, and what is all written into a filesystem (included
with journal) when you delete a file.

Too tired now. I don't think deleting files requires growth of filesystem. I
can delete files on a full fs just fine.
You mean a deletion on origin can cause allocation on snapshot.

It's not a 'snapshot' that allocates, it's always the thin-volume you write to
it...

You must not 'rewrite' chunk referenced by multiple thin volumes.

It's the 'key' difference between old snapshot & thin-provisioning.

With old snapshot - blocks were first copied into many 'snapshots' (crippling
write performance in major way) and then you have updated your origin.

With thins - referenced block is kept in place and new chunk is allocated.

So this should quickly lead you to a conclusion - ANY write in 'fs'
may cause allocation...

Anyway - I've tried hard to 'explain' and if I've still failed - I'm not good
'teacher' and there is no reason to continue this debate.

Regards

Zdenek

Gionatan Danti

2016-05-17 13:09:38 UTC

Post by Zdenek Kabelac
Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.

It is a very reasonable standing

Post by Zdenek Kabelac
Basically whenever 'lvresize' failed - dmeventd plugin now tries
to unconditionally umount any associated thin-volume with
thin-pool above threshold.
For now - plugin 'calls' the tool - lvresize --use-policies.
If this tool FAILs for ANY reason -> umount will happen.
I'll probably put in 'extra' test that 'umount' happens
with >=95% values only.
dmeventd itself has no idea if there is configure 100 or less - it's
the lvresize to see it - so even if you set 100% - and you have enabled
monitoring - you will get umount (but no resize)

Ok, so the "failed to resize" error is also raised when no actual resize
happens, but the call to the "dummy" lvresize fails. Right?

Post by Zdenek Kabelac
If you strictly don't care about any tracing of thin-pool fullness,
disable monitoring in lvm.conf.

While this thin pool should never be overfilled (it has a single,
slightly smaller volume with no snapshot in place) I would really like
to leave monitoring enabled, as it can prevent some nasty suprises (eg:
avoid pool overfilling by a snapshot that is "forgotten" and never removed).

Post by Zdenek Kabelac
Well 'lvmetad' shall not crash, ATM this may kill commands - and further
stop processing - as we rather 'stop' further usage rather than allowing
to cause bigger damage.
So if you have unusual system/device setup causing 'lvmetad' crash - open BZ,
and meawhile set 'use_lvmetad=0' in your lvm.conf till the bug is fixed.

My 2 cents are that the last "yum upgrade", which affected the lvm
tools, needed a system reboot or at least the restart of the lvm-related
services (dmeventd and lvmetad). The strange thing is that, even if
lvmetad crashed, it should be restartable via the lvm2-lvmetad.socket
systemd unit. Is this a wrong expectation?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2016-05-17 13:48:46 UTC

Post by Gionatan Danti

Post by Zdenek Kabelac
Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.

It is a very reasonable standing

Ok, so the "failed to resize" error is also raised when no actual resize
happens, but the call to the "dummy" lvresize fails. Right?

Yes - in general - you've witnessed general tool failure,
and dmeventd is not 'smart' to recognize the reason of failure.

Normally this 'error' should not happen.

And while I'd even say there could have been a 'shortcut'
without even reading VG 'metadata' - since there is profile support,
it can't be known (100% threshold) without actually reading metadata
(so it's quite tricky case anyway)

Post by Gionatan Danti

My 2 cents are that the last "yum upgrade", which affected the lvm tools,
needed a system reboot or at least the restart of the lvm-related services
(dmeventd and lvmetad). The strange thing is that, even if lvmetad crashed, it
should be restartable via the lvm2-lvmetad.socket systemd unit. Is this a
wrong expectation?

Assuming you've been bitten by this one:

https://bugzilla.redhat.com/1334063

possibly? targeted by this commit:

https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=7ef152c07290c79f47a64b0fc81975ae52554919

Regards

Zdenek

Gionatan Danti

2016-05-18 13:47:23 UTC

Post by Zdenek Kabelac
Yes - in general - you've witnessed general tool failure,
and dmeventd is not 'smart' to recognize the reason of failure.
Normally this 'error' should not happen.
And while I'd even say there could have been a 'shortcut'
without even reading VG 'metadata' - since there is profile support,
it can't be known (100% threshold) without actually reading metadata
(so it's quite tricky case anyway)

One question: I did some test (on another machine), deliberately
killing/stopping the lvmetad service/socket. When the pool was almost
full, the following entry was logged in /var/log/messages

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

So it appears than when lvmetad is gracefully stopped/not running,
dmeventd correctly resort to device scanning. On the other hand, in the
previous case, lvmetad was running but returned "Connection refused".
Should/could dmeventd resort to device scanning in this case also?

Post by Zdenek Kabelac
https://bugzilla.redhat.com/1334063
https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=7ef152c07290c79f47a64b0fc81975ae52554919

Very probable. So, after a LVM update, is best practice to restart the
machine or at least the dmeventd/lvmetad services?

One more, somewhat related thing: when thin pool goes full, is a good
thing to remount an ext3/4 in readonly mode (error=remount-ro). But what
to do with XFS which, AFAIK, does not support a similar
readonly-on-error policy?

It is my understanding that upstream XFS has some improvements to
auto-shutdown in case of write errors. Did these improvements already
tickle to production kernels (eg: RHEL6 and 7)?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Gionatan Danti

2016-05-24 13:45:06 UTC

Post by Gionatan Danti
One question: I did some test (on another machine), deliberately
killing/stopping the lvmetad service/socket. When the pool was almost
full, the following entry was logged in /var/log/messages
WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
So it appears than when lvmetad is gracefully stopped/not running,
dmeventd correctly resort to device scanning. On the other hand, in
the previous case, lvmetad was running but returned "Connection
refused". Should/could dmeventd resort to device scanning in this case
also?
...
Very probable. So, after a LVM update, is best practice to restart the
machine or at least the dmeventd/lvmetad services?
One more, somewhat related thing: when thin pool goes full, is a good
thing to remount an ext3/4 in readonly mode (error=remount-ro). But
what to do with XFS which, AFAIK, does not support a similar
readonly-on-error policy?
It is my understanding that upstream XFS has some improvements to
auto-shutdown in case of write errors. Did these improvements already
tickle to production kernels (eg: RHEL6 and 7)?
Thanks.

Sorry for the bump, I would really like to know your opinions on the
above remarks.
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2016-05-24 14:17:30 UTC

Sorry for the bump, I would really like to know your opinions on the above

Dmeventd should not talk to lvmetad at all - I'm saying this for years....

There are some very very hard to fix (IMHO) design issues - and locking
lvmetad in memory would be just one of wrong (IMHO) ways forward....

Anyway - let's see how it evolves here as there are further troubles
with lvmetad & dmeventd - see i.e. here:

https://bugzilla.redhat.com/show_bug.cgi?id=1339210

Regards

Zdenek

Gionatan Danti

2016-05-24 14:28:24 UTC

Post by Zdenek Kabelac
Dmeventd should not talk to lvmetad at all - I'm saying this for years....
There are some very very hard to fix (IMHO) design issues - and
locking lvmetad in memory would be just one of wrong (IMHO) ways
forward....
Anyway - let's see how it evolves here as there are further troubles
https://bugzilla.redhat.com/show_bug.cgi?id=1339210

I'll follow it ;)

Post by Zdenek Kabelac
One more, somewhat related thing: when thin pool goes full, is a good
thing to remount an ext3/4 in readonly mode (error=remount-ro). But
what to do with XFS which, AFAIK, does not support a similar
readonly-on-error policy?
It is my understanding that upstream XFS has some improvements to
auto-shutdown in case of write errors. Did these improvements already
tickle to production kernels (eg: RHEL6 and 7)?

Any thoughts/suggestions on that?
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

2016-05-24 17:17:29 UTC

Post by Gionatan Danti

I'll follow it ;)

Any thoughts/suggestions on that?

Surely they are pushed ASAP when tests passes (upstream first policy here
fully applies)

So yes 6.8 has surely improvements.

Regards

Zdenek

matthew patton

2016-05-18 04:21:11 UTC

Post by Zdenek Kabelac
ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in
loop.....

needing to write the same thing, while first having to acquire the knowledge
of how to do it.

Well, Xen immortalize your name in perpetuity by providing this bash script so it can be included in the tool set. Write less hundred line emails, and instead write more code for peer review.

I would probably be more than happy to write documentation at some point

Except you can't be bothered to read source... Or spend the time to write the aforementioned script which will necessarily teach you a thing or two about the inner workings and events with their outcomes. When you put in the effort the community will no doubt appreciate your newly written documentation. Hop to it, eh?

Debian/Ubuntu systems automatically activating the vg upon opening a
LUKS container, but then the OpenSUSE rescue environment not doing that.

Umm, "rescue environments" are deliberately designed to be more careful about what they do. Are you comparing a OpenSUSE rescue to a Deb/Ubuntu rescue? Or to a Deb "normal" environment? How long have you been around Linux that you are surprised that very different distributions might have different "defaults"?

How to find out about vgchange -ay without having internet access.........

man (8) vgchange perhaps? man -k lvm?

So for me it has been a hard road to begin with and I am still learning.

They maybe less bitching, more experimenting, more practical experience, more man page reading, more knowledge retention, some B12 supplements perhaps? And stick to the bloody topic - nobody cares that you're put off by BugZilla's sparse interface.

matthew patton

2016-05-18 04:57:23 UTC

Xen wrote:

<quote> So there are two different cases as mentioned: existing block writes,
and new block writes. What I was gabbing about earlier would be forcing
a filesystem to also be able to distuinguish between them. You would
have a filesystem-level "no extend" mode or "no allocate" mode that gets
triggered. Initially my thought was to have this get triggered trough
the FS-LVM interface. But, it could also be made operational not through
any membrane but simply by having a kernel (module) that gets passed
this information. In both cases the idea is to say: the filesystem can
do what it wants with existing blocks, but it cannot get new ones.
</quote>

You still have no earthly clue how the various layers work, apparently. For the FS to "know" which of it's blocks can be scribbled on and which can't means it has to constantly poll the block layer (the next layer down may NOT necessarily be LVM) on every write. Goodbye performance.

<quote>
However, it does mean the filesystem must know the 'hidden geometry'
beneath its own blocks, so that it can know about stuff that won't work
anymore.
</quote>

I'm pretty sure this was explained to you a couple weeks ago: it's called "integration". For 50 years filesystems were DELIBERATELY written to be agnostic if not outright ignorant of the underlying block device's peculiarities. That's how modular software is written. Sure, some optimizations have been made by peaking into attributes exposed by the block layer but those attributes don't change over time. They are probed at newfs() time and never consulted again.

Chafing at the inherent tradeoffs caused by "lack of knowledge" was why BTRFS and ZFS were written. It is ignorant to keep pounding the "but I want XFS/EXT+LVM to be feature parity with BTRFS". It's not supposed to, it was never intended and it will never happen. So go use the tool as it's designed or go use something else that tickles your fancy.

<quote>
Will mention that I still haven't tested --errorwhenfull yet.
</quote>

But you conveniently overlook the fact that the FS is NOT remotely full using any of the standard tools - all of a sudden the FS got signaled that the block layer was denying write BIO calls. Maybe there's a helpful kern.err in syslog that you wrote support for?

<quote>
In principle if you had the means to acquire such a flag/state/condition, and the
filesystem would be able to block new allocation wherever whenever, you would already
have a working system. So what is then non-trivial?
...
It seems completely obvious that to me at this point, if anything from
LVM (or e.g. dmeventd) could signal every filesystem on every affected
thin volume, to enter a do-not-allocate state, and filesystems would be
able to fail writes based on that, you would already have a solution
</quote>

And so therefore in order to acquire this "signal" every write has to be done in synchronous fashion and making sure strict data integrity is maintained vis-a-vis filesystem data and metadata. Tweaking kernel dirty block size and flush intervals are knobs that you can be turned to "signal" user-land that write errors are happening. There's no such thing as "immediate" unless you use synchronous function calls from userland.

If you want to write your application to handle "mis-behaved" block layers, then use O-DIRECT+SYNC.

Xen

2016-05-18 14:20:46 UTC