[linux-lvm] Reserve space for specific thin logical volumes

Post by Gionatan Danti
Hi list,
as by the subject: is it possible to reserve space for specific thin
logical volumes?
This can be useful to "protect" critical volumes from having their
space "eaten" by other, potentially misconfigured, thin volumes.
Another, somewhat more convoluted, use case is to prevent snapshot
creation when thin pool space is too low, causing the pool to fill up
completely (with all the associated dramas for the other thin
volumes).

For my 'ideals' thin space reservation (which would be like allocation
in advance) would definitely be a welcome thing.

You can also think of it in terms of a default pre-allocation setting.
I.e. every volume keeps a bit of space over-allocated while only doing
so if there is actually room in the thin volume (some kind of lazy
allocation?).

Of course not trying to steal your question here and I do not know if
any such thing is possible but it might be and I wouldn't mind hearing
the answer as well.

No offense intended. Regards.

Gionatan Danti

7 years ago

Hi all,
anyone with some informations?

Any comment would be very appreciated :)
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

...

Hi

Not sure for which information are you looking for ??

Having 'reserved' space for thinLV - means - you have to add more space
to this thin-pool - there is not much point in keeping space in VG,
which could be only used for extension of particular LV ??

What we do have thought is 'shard' "_pmspare' extra space for metadata
recovery, but there is nothing like that for data space (and not even planned).

There is support for so-called- fully-provisioned thinLVs withing thin-pool
in-plan, but that probably doesn't suit your needs.

The first question here is - why do you want to use thin-provisioning ?

As thin-provisioning is about 'promising the space you can deliver later when
needed' - it's not about hidden magic to make the space out-of-nowhere.
The idea of planning to operate thin-pool on 100% fullness boundary is simply
not going to work well - it's not been designed for that use-case - so if
that's been your plan - you will need to seek for other solution.
(Unless you seek for those 100% provisioned devices)

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
As thin-provisioning is about 'promising the space you can deliver
later when needed' - it's not about hidden magic to make the space
out-of-nowhere.
The idea of planning to operate thin-pool on 100% fullness boundary is
simply not going to work well - it's not been designed for that
use-case

I am going to rear my head again and say that a great many people would
probably want a thin-provisioning that does exactly that ;-).

I mean you have it designed for auto-extension but there are also many
people that do not want to auto-extend and just share available
resources more flexibly.

For those people safety around 100% fullness boundary becomes more
important.

I don't really think there is another solution for that.

I don't think BTRFS is really a good solution for that.

So what alternatives are there, Zdenek? LVM is really the only thing
that feels "good" to us.

Are there structural design inhibitions that would really prevent this
thing from ever arising?

Zdenek Kabelac

7 years ago

I am going to rear my head again and say that a great many people would
probably want a thin-provisioning that does exactly that ;-).

Wondering from where they could get this idea...
We always communicate clearly - do not plan to use 100% full unresizable
thin-pool as a part of regular work-flow - it's always critical situation
often even leading to system's reboot and full check of all volumes.

Post by Xen
I mean you have it designed for auto-extension but there are also many people
that do not want to auto-extend and just share available resources more flexibly.
For those people safety around 100% fullness boundary becomes more important.
I don't really think there is another solution for that.
I don't think BTRFS is really a good solution for that.
So what alternatives are there, Zdenek? LVM is really the only thing that
feels "good" to us.

Thin-pool needs to be ACTIVELY monitored and proactively either added more PV
free space to the VG or eliminating unneeded 'existing' provisioned blocks
(fstrim, dropping snapshots, removal of unneeded thinLVs.... - whatever
comes on your mind to make a more free space in thin-pool - lvm2 fully
supports now to call 'smart' scripts directly out of dmeventd for such action.

It's illusion to hope anyone will be able to operate lvm2 thin-pool at 100%
fullness reliable - there should be always enough room to give 'scripts'
reaction time to gain some more space in-time - so thin-pool can serve free
chunks for provisioning - that's been design - to deliver blocks when needed,
not to brake system

Post by Xen
Are there structural design inhibitions that would really prevent this thing
from ever arising?

Yes, performance and resources consumption.... :)

And there is fundamental difference between full 'block device' sharing
space with other device - compared with single full filesystem - you can't
compare these 2 things at all.....

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
Wondering from where they could get this idea...
We always communicate clearly - do not plan to use 100% full
unresizable thin-pool as a part of regular work-flow

No one really PLANS for that.

They probably plan for some 80% usage or less.

But they *do* use thin provisioning for over-provisioning.

So the issue is runaway processes.

Typically the issue won't be "as planned" behaviour.

I still intend to write more better monitoring support for myself if I
ever get the chance to code again.

Post by Zdenek Kabelac
- it's always
critical situation often even leading to system's reboot and full
check of all volumes.

I know that but the issue is to prevent the critical situation (if the
design should allow for that).

TWO levels of failure:

- Filesystem level failure
- Block layer level failure

File system level failure can also not be critical because of using
non-critical volume because LVM might fail even though filesystem does
not fail or applications.

Block level layer failure is much more serious, and can prevent system
from recovering when it otherwise could.

Post by Zdenek Kabelac
Thin-pool needs to be ACTIVELY monitored

But monitoring is labour intensive task unless monitoring systems are in
place with email reporting and so on.

Do those systems exist? Do we have them available?

I know I wrote one the other day and it is still working so I am not so
much in a problem right now.

But in general it is still a poor solution for me (because I didn't
develop it further and it is just a Bash script using some log reading
functionality using the older version of LVM's reporting feature into
the syslog (systemd-journald).

Post by Zdenek Kabelac
and proactively either added
more PV free space to the VG

That is not possible in the use case described. Not all systems have
instantly more space available, or even able to expand, and may still
want to use LVM thin provisioning because of the flexibility it
provides.

Post by Zdenek Kabelac
or eliminating unneeded 'existing'
provisioned blocks (fstrim

Yes that is very good to do that, but also needs setup.

Post by Zdenek Kabelac
, dropping snapshots

Might also be good in more fully-fledged system.

Post by Zdenek Kabelac
, removal of unneeded
thinLVs....

Only manual intervention this one... and last resort only to prevent
crash so not really useful in general situation?

Post by Zdenek Kabelac
- whatever comes on your mind to make a more free space
in thin-pool

I guess but that is lot of manual intervention. We like to also be safe
in case we're sleeping ;-).

Post by Zdenek Kabelac
- lvm2 fully supports now to call 'smart' scripts
directly out of dmeventd for such action.

Yes that is very good, thank you for that. I am still on older LVM
making use of existing logging feature, which also works for me for now.

Post by Zdenek Kabelac
It's illusion to hope anyone will be able to operate lvm2 thin-pool at
100% fullness reliable

That's not what we want.

100% is not the goal. Is exceptional situation to begin with.

Post by Zdenek Kabelac
- there should be always enough room to give
'scripts' reaction time

Sure but some level of "room reservation" is only to buy time -- or
really perhaps to make sure main system volume doesn't crash when data
volume fills up by accident.

But system volumes already have reserved space filesystem level.

But do they also have this space reserved in actuality? I doubt it. Not
on the LVM level.

So it is only to mirror that filesystem feature.

Now you could do something on the filesystem level to ensure that those
blocks are already allocated on LVM level, that would be good too.

Post by Zdenek Kabelac
to gain some more space in-time

Yes email monitoring would be most important I think for most people.

Post by Zdenek Kabelac
- so thin-pool can
serve free chunks for provisioning - that's been design

Aye but does design have to be complete failure when condition runs out?

I am just asking whether or not there is a clear design limitation that
would ever prevent safety in operation when 100% full (by accident).

You said before that there was design limitation, that concurrent
process cannot know whether the last block has been allocated.

Post by Zdenek Kabelac
- to deliver
blocks when needed,
not to brake system

But it's exceptional situation to begin with.

Post by Xen
Are there structural design inhibitions that would really prevent this
thing from ever arising?

Yes, performance and resources consumption.... :)

Right, that was my question I guess.

So you said before it was a concurrent thread issue.

Concurrent allocation issue using search algorithm to find empty blocks.

Post by Zdenek Kabelac
And there is fundamental difference between full 'block device' sharing
space with other device - compared with single full filesystem - you
can't compare these 2 things at all.....

You mean BTRFS being full filesystem.

I still think theoretically solution would be easy if you wanted it.

I mean I have been programmer for many years too ;-).

But it seems to me desire is not there.

Xen

7 years ago

Post by Xen
But system volumes already have reserved space filesystem level.
But do they also have this space reserved in actuality? I doubt it.
Not on the LVM level.
So it is only to mirror that filesystem feature.
Now you could do something on the filesystem level to ensure that
those blocks are already allocated on LVM level, that would be good
too.

This made no sense, sorry.

No system should really run main system volume on LVM thin (or at least
there no great need for it) so the typical failure case would be:

- data volume fills up
- entire system crashes

THAT is the only problem LVM has today.

It's not just that the thin pool is going to be unreliable

But that it also causes a kernel panic in due time. Usually within 10-20
seconds.

Zdenek Kabelac

7 years ago

Post by Zdenek Kabelac
Wondering from where they could get this idea...
We always communicate clearly - do not plan to use 100% full
unresizable thin-pool as a part of regular work-flow

No one really PLANS for that.
They probably plan for some 80% usage or less.

Thin-provisioning is - about 'postponing' available space to be delivered in
time - let's have an example:

You order some work which cost $100.
You have just $30, but you know, you will have $90 next week -
so the work can start....

But it seems some users know it will cost $100, but they still think the work
could be done with $10 and it's will 'just' work the same....

Sorry it won't....

Post by Xen
But they *do* use thin provisioning for over-provisioning.

Noone is blaming anyone for over-provisioning - but using over-provising
without the plan of adding this space in case the space is really needed -
that's the main issue and problem here.

thin-provisiong is giving you extra TIME - not the SPACE :)

Post by Xen
File system level failure can also not be critical because of using
non-critical volume because LVM might fail even though filesystem does not
fail or applications.

So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
those may raise pretty major 'provisioning' storm....

Post by Xen
Block level layer failure is much more serious, and can prevent system from
recovering when it otherwise could.

Yep - the idea is - when thin-pool gets full - it will stop working,
but you can't rely on 'usable' system when this happens....

Of course - it differs on case by case - if you run your /rootvolume
out of such overfilled thin-pool - you have much bigger set of problems
compared with user which has just some mount data volume - so
the rest of system is sitting on some 'fully provisioned' volume....

But we are talking about generic case here no on some individual sub-cases
where some limitation might give you the chance to rescue better...

Post by Xen
That is not possible in the use case described. Not all systems have instantly
more space available, or even able to expand, and may still want to use LVM
thin provisioning because of the flexibility it provides.

Again - it's admin's gambling here - if he let the system overprovisiong
and doesn't have 'backup' plan - you can't blame here lvm2.....

Post by Xen
Only manual intervention this one... and last resort only to prevent crash so
not really useful in general situation?

Let's simplify it for the case:

You have 1G thin-pool
You use 10G of thinLV on top of 1G thin-pool

And you ask for 'sane' behavior ??

You would have to probably write your whole own linux kernel - to continue
work reasonable well when 'write-failure' starts to appear.

It's completely out-of-hand of dm/lvm2.....

The most 'sane' is to stop and reboot and fix missing space....

Any idea of having 'reserved' space for 'prioritized' applications and other
crazy ideas leads to nowhere.

Actually there is very good link to read about:

https://lwn.net/Articles/104185/

Hopefully this will bring your mind further ;)

Post by Zdenek Kabelac
- lvm2 fully supports now to call 'smart' scripts
directly out of dmeventd for such action.

Yes that is very good, thank you for that. I am still on older LVM making use
of existing logging feature, which also works for me for now.

Well yeah - it's not useless to discuses solution for old releases of lvm2...

Lvm2 should be compilable and usable on older distros as well - so upgrade and
do not torture yourself with older lvm2....

Post by Zdenek Kabelac
It's illusion to hope anyone will be able to operate lvm2 thin-pool at
100% fullness reliable

That's not what we want.
100% is not the goal. Is exceptional situation to begin with.

And we believe it's fine to solve exceptional case by reboot.
Since the effort you would need to put into solve all kernel corner case is
absurdly high compared with the fact 'it's exception' for normally used and
configured and monitored thin-pool....

So don't expect lvm2 team will be solving this - there are more prio work....

Post by Zdenek Kabelac
- there should be always enough room to give
'scripts' reaction time

Sure but some level of "room reservation" is only to buy time -- or really
perhaps to make sure main system volume doesn't crash when data volume fills > up by accident.

If the system volume IS that important - don't use it with over-provisiong!

The answer is that simple.

You can user different thin-pool for your system LV where you can maintain
snapshot without over-provisioning.

It's way more practical solution the trying to fix OOM problem :)

Post by Zdenek Kabelac
to gain some more space in-time

Yes email monitoring would be most important I think for most people.

Put mail messaging into plugin script then.
Or use any monitoring software for messages in syslog - this worked pretty
well 20 years back - and hopefully still works well :)

Post by Zdenek Kabelac
serve free chunks for provisioning - that's been design

Aye but does design have to be complete failure when condition runs out?

YES

Post by Xen
I am just asking whether or not there is a clear design limitation that would
ever prevent safety in operation when 100% full (by accident).

Don't user over-provisioning in case you don't want to see failure.

It's the same as you should not overcommit your RAM in case you do not want to
see OOM....

Post by Xen
I still think theoretically solution would be easy if you wanted it.

My best advice - please you should try to write it - so you would see more in
depth how yours 'theoretical solution' meets with reality....

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
Thin-provisioning is - about 'postponing' available space to be
delivered in time

That is just one use case.

Many more people probably use it for other use case.

Which is fixed storage space and thin provisioning of available storage.

Post by Zdenek Kabelac
You order some work which cost $100.
You have just $30, but you know, you will have $90 next week -
so the work can start....

I know the typical use case that you advocate yes.

Post by Zdenek Kabelac
But it seems some users know it will cost $100, but they still think
the work could be done with $10 and it's will 'just' work the same....

No that's not what people want.

People want efficient usage of data without BTRFS, that's all.

Post by Xen
File system level failure can also not be critical because of using
non-critical volume because LVM might fail even though filesystem does
not fail or applications.

So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
those may raise pretty major 'provisioning' storm....

Yes but still system does not need to crash, right.

Post by Xen
Block level layer failure is much more serious, and can prevent system
from recovering when it otherwise could.

Yep - the idea is - when thin-pool gets full - it will stop working,
but you can't rely on 'usable' system when this happens....
Of course - it differs on case by case - if you run your /rootvolume
out of such overfilled thin-pool - you have much bigger set of problems
compared with user which has just some mount data volume - so
the rest of system is sitting on some 'fully provisioned' volume....

...

Yes.

Post by Zdenek Kabelac
But we are talking about generic case here no on some individual sub-cases
where some limitation might give you the chance to rescue better...

But no one in his right mind currently runs /rootvolume out of thin pool
and in pretty much all cases probably it is only used for data or for
example of hosting virtual hosts/containers/virtualized
environments/guests.

So Data use for thin volume is pretty much intended/common/standard use
case.

Now maybe amount of people that will be able to have running system
after data volumes overprovision/fill up/crash is limited.

However, from both a theoretical and practical standpoint being able to
just shut down whatever services use those data volumes -- which is only
possible if base system is still running -- makes for far easier
recovery than anything else, because how are you going to boot system
reliably without using any of those data volumes? You need rescue mode
etc.

So I would say it is the general use case where LVM thin is used for
data, or otherwise it is the "special" use case used by 90% of people...

In any case it wouldn't hurt anyone who didn't fall into that "special
use case" scenario, it would benefit everyone.

Unless you are speaking perhaps about unmitigatable performance
considerations.

Then it becomes indeed a tradeoff but you are the better judge of that.

Post by Zdenek Kabelac
Again - it's admin's gambling here - if he let the system
overprovisiong
and doesn't have 'backup' plan - you can't blame here lvm2.....

He might have system backups.

He might be able to recover his system if his system is still allowed to
be logged into.

That should be enough backup plan for most people who do not have
expandable storage.

So maybe this is not main use case for LVM2, but it is still common use
case that people keep asking about. So there is a demand for this.

Normal data volumes filling up is pretty much same situation.

Same user will not have backup plan in case volumes fill up.

Thin provisioning does not make that worse, normally.

That's where we start out from.

Thin provisioning with overprosisioning and expandable storage does
improve that thing for those people, that want to have larger
filesystems to cater to growth.

But people using slightly larger filesystems only for data space sharing
between volumes...

Are trying to get a bit more flexibility (for example for moving data
from partition to partition).

So for example I have 50GB VPS with Thin for data volumes.

If I want to reorganize my data across volumes I only have to ensure
enough space in thin pool, or move in smaller parts so there is enough
space for that.

Then I run fstrim and then everything is alright again.

This is benefit of me for thin pool.

It just makes moving data around a bit (a lot) easier.

So I first check thin space and then do operation.

So the only time when I near the "full" mark is when I do these
operations.

My system is not data intensive (with just 50GB) and does not run quick
risk of filling up -- but it could happen.

So that's all.

Regards.

Zdenek Kabelac

7 years ago

Thin-provisioning is - about 'postponing' available space to be
delivered in time

That is just one use case.
Many more people probably use it for other use case.
Which is fixed storage space and thin provisioning of available storage.

You order some work which cost $100.
You have just $30, but you know, you will have $90 next week -
so the work can start....

I know the typical use case that you advocate yes.

But it seems some users know it will cost $100, but they still think
the work could be done with $10 and it's will 'just' work the same....

No that's not what people want.
People want efficient usage of data without BTRFS, that's all.

...

What's wrong with BTRFS....

Either you want fs & block layer tied together - that the btrfs/zfs approach

or you want

layered approach with separate 'fs' and block layer (dm approach)

If you are advocating here to start mixing 'dm' with 'fs' layer, just
because you do not want to use 'btrfs' you'll probably not gain main traction
here...

Post by Xen
File system level failure can also not be critical because of using
non-critical volume because LVM might fail even though filesystem does not
fail or applications.

So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
those may raise pretty major 'provisioning' storm....

Yes but still system does not need to crash, right.

We need to see EXACTLY which kind of crash do you mean.

If you are using some older kernel - then please upgrade first and
provide proper BZ case with reproducer.

BTW you can imagine an out-of-space thin-pool with thin volume and filesystem
as a FS, where some writes ends with 'write-error'.

If you think there is OS system which keeps running uninterrupted, while
number of writes ends with 'error' - show them :) - maybe we should stop
working on Linux and switch to that (supposedly much better) different OS....

But we are talking about generic case here no on some individual sub-cases
where some limitation might give you the chance to rescue better...

You can have different pools and you can use rootfs with thins to easily test
i.e. system upgrades....

Post by Xen
So Data use for thin volume is pretty much intended/common/standard use case.
Now maybe amount of people that will be able to have running system after data
volumes overprovision/fill up/crash is limited.

Most thin-pool users are AWARE how to properly use it ;) lvm2 tries to
minimize (data-lost) impact for misused thin-pools - but we can't spend too
much effort there....

So what is important:
'commited' data (i.e. transaction database) are never lost
fsck after reboot should work.

If any of these 2 condition do not work - that's serious bug.

But if you advocate for continuing system use of out-of-space thin-pool - that
I'd probably recommend start sending patches... as an lvm2 developer I'm not
seeing this as best time investment but anyway...

Post by Xen
However, from both a theoretical and practical standpoint being able to just
shut down whatever services use those data volumes -- which is only possible

Are you aware there is just one single page cache shared for all devices
in your system ?

Post by Xen
if base system is still running -- makes for far easier recovery than anything
else, because how are you going to boot system reliably without using any of
those data volumes? You need rescue mode etc.

Again do you have use-case where you see a crash of data mounted volume
on overfilled thin-pool ?

On my system - I could easily umount such volume after all 'write' requests
are timeouted (eventually use thin-pool with --errorwhenfull y for instant
error reaction.

So please can you stop repeating overfilled thin-pool with thin LV data volume
kills/crashes machine - unless you open BZ and prove otherwise - you will
surely get 'fs' corruption but nothing like crashing OS can be observed on my
boxes....

We are here really interested in upstream issues - not about missing bug fixes
backports into every distribution and its every released version....

Post by Xen
He might be able to recover his system if his system is still allowed to be
logged into.

There is no problem with that as long as /rootfs has consistently working fs!

Regards

Zdene

Xen

7 years ago

Post by Zdenek Kabelac
What's wrong with BTRFS....

I don't think you are a fan of it yourself.

Post by Zdenek Kabelac
Either you want fs & block layer tied together - that the btrfs/zfs approach

Gionatan's responses used only Block layer mechanics.

Post by Zdenek Kabelac
or you want
layered approach with separate 'fs' and block layer (dm approach)

Of course that's what I want or I wouldn't be here.

Post by Zdenek Kabelac
If you are advocating here to start mixing 'dm' with 'fs' layer, just
because you do not want to use 'btrfs' you'll probably not gain main
traction here...

You know Zdenek, it often appears to me your job here is to dissuade
people from having any wishes or wanting anything new.

But if you look a little bit further, you will see that there is a lot
more possible within the space that you define, than you think in a
black & white vision.

"There are more things in Heaven and Earth, Horatio, than is dreamt of
in your philosophy" ;-).

I am pretty sure many of the impossibilities you cite spring from a
misunderstanding of what people want, you think they want something
extreme, but it is often much more modest than that.

Although personally I would not mind communication between layers in
which providing layer (DM) communicates some stuff to using layer (FS)
but 90% of the time that is not even needed to implement what people
would like.

Also we see ext4 being optimized around 4MB block sizes right? To create
better allocation.

So that's example of "interoperation" without mixing layers.

I think Gionatan has demonstrated that pure block layer functionality,
is possible to have more advanced protection ability that does not need
any knowledge about filesystems.

Post by Zdenek Kabelac
We need to see EXACTLY which kind of crash do you mean.
If you are using some older kernel - then please upgrade first and
provide proper BZ case with reproducer.

Yes apologies here, I responded to this thing earlier (perhaps a year
ago) and the systems I was testing on was 4.4 kernel. So I cannot
currently confirm and probably is already solved (could be right).

Back then the crash was kernel messages on TTY and then after some 20-30
seconds total freeze. After I copied too much data to (test) thin pool.

Probably irrelevant now if already fixed.

Post by Zdenek Kabelac
BTW you can imagine an out-of-space thin-pool with thin volume and
filesystem as a FS, where some writes ends with 'write-error'.
If you think there is OS system which keeps running uninterrupted,
while number of writes ends with 'error' - show them :) - maybe we
should stop working on Linux and switch to that (supposedly much
better) different OS....

I don't see why you seem to think that devices cannot be logically
separated from each other in terms of their error behaviour.

If I had a system crashing because I wrote to some USB device that was
malfunctioning, that would not be a good thing either.

I have said repeatedly that the thin volumes are data volumes. Entire
system should not come crashing down.

I am sorry if I was basing myself on older kernels in those messages,
but my experience dates from a year ago ;-).

Linux kernel has had more issues with USB for example that are
unacceptable, and even Linus Torvalds himself complained about it.
Queues filling up because of pending writes to USB device and entire
system grinds to a halt.

Unacceptable.

Post by Zdenek Kabelac
You can have different pools and you can use rootfs with thins to
easily test i.e. system upgrades....

Sure but in the past GRUB2 would not work well with thin, I was basing
myself on that...

I do not see real issue with using thin rootfs myself but grub-probe
didn't work back then and OpenSUSE/GRUB guy attested to Grub not having
thin support for that.

Post by Zdenek Kabelac
Most thin-pool users are AWARE how to properly use it ;) lvm2 tries
to minimize (data-lost) impact for misused thin-pools - but we can't
spend too much effort there....

Everyone would benefit from more effort being spent there, because it
reduces the problem space and hence the burden on all those maintainers
to provide all types of safety all the time.

EVERYONE would benefit.

Post by Zdenek Kabelac
But if you advocate for continuing system use of out-of-space
thin-pool - that I'd probably recommend start sending patches... as
an lvm2 developer I'm not seeing this as best time investment but
anyway...

Not necessarily that the system continues in full operation,
applications are allowed to crash or whatever. Just that system does not
lock up.

But you say these are old problems and now fixed...

I am fine if filesystem is told "write error".

Then filesystem tells application "write error". That's fine.

But it might be helpful if "critical volumes" can reserve space in
advance.

That is what Gionatan was saying...?

Filesystem can also do this itself but not knowing about thin layer it
has to write random blocks to achieve this.

I.e. filesystem may guess about thin layout underneath and just write 1
byte to each block it wants to allocate.

But feature could more easily be implemented by LVM -- no mixing of
layers.

So number of (unallocated) blocks are reserved for critical volume.

When number drops below "needed" free blocks for those volumes, system
starts returning errors for volumes not that critical volume.

I don't see why that would be such a disturbing feature.

You just cause allocator to error earlier for non-critical volumes, and
allocator to proceed as long as possible for critical volumes.

Only think you need is runtime awareness of available free blocks.

You said before this is not efficiently possible.

Such awareness would be required, even if approximately, to implement
any such feature.

But Gionatan was only talking about volume creation in latest messages.

Post by Xen
However, from both a theoretical and practical standpoint being able
to just shut down whatever services use those data volumes -- which is
only possible

Are you aware there is just one single page cache shared for all devices
in your system ?

Well I know the kernel is badly designed in that area. I mean this was
the source of the USB problems. Torvalds advocated lowering the size of
the write buffer.

Which distributions then didn't do and his patch didn't even make it
through :p.

He said "50 MB write cache should be enough for everyone" and not 10% of
total memory ;-).

Post by Zdenek Kabelac
Again do you have use-case where you see a crash of data mounted volume
on overfilled thin-pool ?

Yes, again, old experiences.

Post by Zdenek Kabelac
On my system - I could easily umount such volume after all 'write' requests
are timeouted (eventually use thin-pool with --errorwhenfull y for
instant error reaction.

That's good, I didn't have that back then (and still not).

It is Debian 8 / Kubuntu 16.04 systems.

Post by Zdenek Kabelac
So please can you stop repeating overfilled thin-pool with thin LV
data volume kills/crashes machine - unless you open BZ and prove
otherwise - you will surely get 'fs' corruption but nothing like
crashing OS can be observed on my boxes....

But when I talked about this a year ago you didn't seem to comprehend I
was talking about an older system (back then not so old) or acknowledged
that these problems had (once) existed, so I also didn't know they would
now already be solved.

Sometimes if you just acknowledge problems were there before but not
anymore, makes it a lot easier.

We spoke about this topic a year ago as well, and perhaps you didn't
understand me because for you the problems were already fixed (in your
LVM).

Post by Zdenek Kabelac
We are here really interested in upstream issues - not about missing
bug fixes backports into every distribution and its every released
version....

I understand. But it's hard for me to know which is which.

These versions are in widespread use.

Compiling your own packages is also system maintenance burden etc.

So maybe our disagreement back then came from me experiencing something
that was already solved upstream (or in later kernels).

Post by Xen
He might be able to recover his system if his system is still allowed
to be logged into.

There is no problem with that as long as /rootfs has consistently working fs!

Well I guess it was my Debian 8 / kernel 4.4 problem then...

Zdenek Kabelac

7 years ago

On block layer - there are many things black & white....

If you don't know which process 'create' written page, nor if you write
i.e. filesystem data or metadata or any other sort of 'metadata' information,
you can hardly do any 'smartness' logic on thin block level side.

Although personally I would not mind communication between layers in which
providing layer (DM) communicates some stuff to using layer (FS) but 90% of
the time that is not even needed to implement what people would like.

The philosophy with DM device is - you can replace then online with something
else - i.e. you could have a linear LV which is turned to 'RAID" and than it
could be turned to 'Cache RAID' and then even to thinLV - all in one raw
on life running system.

So what filesystem should be doing in this case ?

Should be doing complex question of block-layer underneath - checking current
device properties - and waiting till the IO operation is processed - before
next IO comes in the process - and repeat the some in very synchronous
slow logic ?? Can you imagine how slow this would become ?

The main problem here is - the user typically only see a one single localized
problem - without putting it into a global context.

So of course - if you 'restrict' a device stack to some predefined fixed
state which holds 'forever' you may get far more chances to get couple things
running in some more optimal way - but that's not what lvm2 aims to support.

We are targeting 'generic' usage not a specialized case - which fits 1 user
out of 1000000 - and every other user needs something 'slightly' different....

Also we see ext4 being optimized around 4MB block sizes right? To create
better allocation

I don't think there is anything related...
Thin chunk-size ranges from 64KiB to 1GiB....

So that's example of "interoperation" without mixing layers.

The only inter-operation is the main filesystem (like extX & XFS) are getting
fixed for better reactions for ENOSPC...
and WAY better behavior when there are 'write-errors' - surprisingly there
were numerous faulty logic and expectation encoded in them...

I think Gionatan has demonstrated that pure block layer functionality, is
possible to have more advanced protection ability that does not need any
knowledge about filesystems.

thin-pool provides same level of protection in terms of not letting you
create a new thin-lv when thin-pool is above configure threshold...

And to compare apples with apples - you need to compare performance of

ZFS with zpolls with thin with thinpools running directly on top of device.

If zpools - are 'equally' fast as thins - and gives you better protection,
and more sane logic the why is still anyone using thins???

I'd really love to see some benchmarks....

Of course if you slow down speed of thin-pool and add way more synchronization
points and consume 10x more memory :) you can get better behavior in those
exceptional cases which are only hit by unexperienced users who tends to
intentionally use thin-pools in incorrect way.....

Yes apologies here, I responded to this thing earlier (perhaps a year ago) and
the systems I was testing on was 4.4 kernel. So I cannot currently confirm and
probably is already solved (could be right).
Back then the crash was kernel messages on TTY and then after some 20-30

there is by default 60sec freeze, before unresized thin-pool start to reject
all write to unprovisioned space as 'error' and switches to out-of-space
state. There is though a difference if you are out-of-space in data
or metadata - the later one is more complex...

Post by Zdenek Kabelac
If you think there is OS system which keeps running uninterrupted,
while number of writes ends with 'error' - show them :) - maybe we
should stop working on Linux and switch to that (supposedly much
better) different OS....

I don't see why you seem to think that devices cannot be logically separated
from each other in terms of their error behaviour.

In page cache there are no thing logically separated - you have 'dirty' pages
you need to write somewhere - and if you writes leads to errors,
and system reads errors back instead of real-data - and your execution
code start to run on completely unpredictable data-set - well 'clean' reboot
is still very nice outcome IMHO....

If I had a system crashing because I wrote to some USB device that was
malfunctioning, that would not be a good thing either.

Well try to BOOT from USB :) and detach and then compare...
Mounting user data and running user-space tools out of USB is uncomparable...

Linux kernel has had more issues with USB for example that are unacceptable,
and even Linus Torvalds himself complained about it. Queues filling up because
of pending writes to USB device and entire system grinds to a halt.
Unacceptable.

AFAIK - this is still not resolved issue...

Post by Zdenek Kabelac
You can have different pools and you can use rootfs with thins to
easily test i.e. system upgrades....

Sure but in the past GRUB2 would not work well with thin, I was basing myself
on that...

/boot cannot be on thin

/rootfs is not a problem - there will be even some great enhancement for Grub
to support this more easily and switching between various snapshots...

Post by Zdenek Kabelac
Most thin-pool users are AWARE how to properly use it ;) lvm2 tries
to minimize (data-lost) impact for misused thin-pools - but we can't
spend too much effort there....

Everyone would benefit from more effort being spent there, because it reduces
the problem space and hence the burden on all those maintainers to provide all
types of safety all the time.
EVERYONE would benefit.

Fortunately most users NEVER need it ;)
Since they properly operate thin-pool and understand it's weak points....

Not necessarily that the system continues in full operation, applications are
allowed to crash or whatever. Just that system does not lock up.

When you get bad data from your block device - your system's reaction is
unpredictable - if your /rootfs cannot store its metadata - the most sane
behavior is to stop - all other solutions are so complex and complicated, that
spending resources to avoid hitting this state are way better spent effort...

Lvm2 ensures block layer behavior is sane - but cannot be held responsible
that all layers above are 'sane' as well...

If you hit 'fs' bug - report the issue to fs maintainer.
If you experience user-space faulty app - solve the issue there.

Then filesystem tells application "write error". That's fine.
But it might be helpful if "critical volumes" can reserve space in advance.

Once again - USE different pool - solve problems at proper level....
Do not over-provision critical volumes...

I.e. filesystem may guess about thin layout underneath and just write 1 byte
to each block it wants to allocate.

:) so how do you resolve error paths - i.e. how do you restore space
you have not actually used....
There are so many problems with this you can't even imagine...
Yeah - we've spent quite some time in past analyzing those paths....

So number of (unallocated) blocks are reserved for critical volume.

Please finally stop thinking about some 'reserved' storage for critical
volume. It leads to nowhere....

When number drops below "needed" free blocks for those volumes, system starts
returning errors for volumes not that critical volume.

Do the right action at right place.

For critical volume use non-overprovisiong pools - there is nothing better
you can do - seriously!

For other cases - resolve the issue at userspace when dmeventd calls you...

I don't see why that would be such a disturbing feature.

Maybe start to understand how kernel works in practice ;)

Otherwise you spend you live boring developers with ideas which simply cannot
work...

You just cause allocator to error earlier for non-critical volumes, and
allocator to proceed as long as possible for critical volumes.

So use 2 different POOLS, problem solved....

You need to focus on simple solution for a problem instead of exponentially
over-complicating 'bad' solution....

We spoke about this topic a year ago as well, and perhaps you didn't
understand me because for you the problems were already fixed (in your LVM).

As said - if you see a problem/bug - open BZ case - so it'd be analyzed -
instead of spreading FUD in mailing list, where noone tells which version of
lvm2 and which kernel version - but we are just informed it's crashing and
unusable...

Post by Zdenek Kabelac
We are here really interested in upstream issues - not about missing
bug fixes backports into every distribution and its every released
version....

I understand. But it's hard for me to know which is which.
These versions are in widespread use.
Compiling your own packages is also system maintenance burden etc.

Well it's always about checking 'upstream' first and then bothering your
upstream maintainer...

Eventually switching to distribution with better support in case your existing
one has 'nearly' zero reaction....

So maybe our disagreement back then came from me experiencing something that
was already solved upstream (or in later kernels).

Yes - we are always interested in upstream problem.

We really cannot be solving problems of every possible deployed combination of
software.

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
On block layer - there are many things black & white....
If you don't know which process 'create' written page, nor if you write
i.e. filesystem data or metadata or any other sort of 'metadata' information,
you can hardly do any 'smartness' logic on thin block level side.

You can give any example to say that something is black and white
somewhere, but I made a general point there, nothing specific.

Post by Zdenek Kabelac
The philosophy with DM device is - you can replace then online with
something else - i.e. you could have a linear LV which is turned to
'RAID" and than it could be turned to 'Cache RAID' and then even to
thinLV - all in one raw
on life running system.

I know.

Post by Zdenek Kabelac
So what filesystem should be doing in this case ?

I believe in most of these systems you cite the default extent size is
still 4MB, or am I mistaken?

Post by Zdenek Kabelac
Should be doing complex question of block-layer underneath - checking
current device properties - and waiting till the IO operation is
processed - before next IO comes in the process - and repeat the
some in very synchronous
slow logic ?? Can you imagine how slow this would become ?

You mean a synchronous way of checking available space in thin volume by
thin pool manager?

Post by Zdenek Kabelac
We are targeting 'generic' usage not a specialized case - which fits 1
user out of 1000000 - and every other user needs something 'slightly'
different....

That is completely exaggerative.

I think you will find this issue comes up often enough to think that it
is not one out of 1000000 and besides unless performance considerations
are at the heart of your ...reluctance ;-) no one stands to lose
anything.

So only question is design limitations or architectural considerations
(performance), not whether it is a wanted feature or not (it is).

Post by Zdenek Kabelac
I don't think there is anything related...
Thin chunk-size ranges from 64KiB to 1GiB....

Thin allocation is not by default in extent-sizes?

Post by Zdenek Kabelac
The only inter-operation is the main filesystem (like extX & XFS) are
getting fixed for better reactions for ENOSPC...
and WAY better behavior when there are 'write-errors' - surprisingly
there were numerous faulty logic and expectation encoded in them...

Well that's good right. But I did read here earlier about work between
ExtFS team and LVM team to improve allocation characteristics to better
align with underlying block boundaries.

Post by Zdenek Kabelac
If zpools - are 'equally' fast as thins - and gives you better protection,
and more sane logic the why is still anyone using thins???

I don't know. I don't like ZFS. Precisely because it is a 'monolith'
system that aims to be everything. Makes it more complex and harder to
understand, harder to get into, etc.

Post by Zdenek Kabelac
Of course if you slow down speed of thin-pool and add way more
synchronization points and consume 10x more memory :) you can get
better behavior in those exceptional cases which are only hit by
unexperienced users who tends to intentionally use thin-pools in
incorrect way.....

I'm glad you like us ;-).

Post by Xen
Yes apologies here, I responded to this thing earlier (perhaps a year
ago) and the systems I was testing on was 4.4 kernel. So I cannot
currently confirm and probably is already solved (could be right).
Back then the crash was kernel messages on TTY and then after some 20-30

there is by default 60sec freeze, before unresized thin-pool start to reject
all write to unprovisioned space as 'error' and switches to
out-of-space state. There is though a difference if you are
out-of-space in data
or metadata - the later one is more complex...

...

I can't say whether it was that or not. I am pretty sure the entire
system froze for longer than 60 seconds.

Post by Zdenek Kabelac
In page cache there are no thing logically separated - you have 'dirty' pages
you need to write somewhere - and if you writes leads to errors,
and system reads errors back instead of real-data - and your execution
code start to run on completely unpredictable data-set - well 'clean'
reboot is still very nice outcome IMHO....

Well even if that means some dirty pages are lost before the application
discovers it, any read or write errors should at some point lead to the
application to shut down right.

I think for most applications the most sane behaviour would simply be to
shut down.

Unless there is more sophisticated error handling.

I am not sure what we are arguing about at this point.

Application needs to go anyway.

Post by Xen
If I had a system crashing because I wrote to some USB device that was
malfunctioning, that would not be a good thing either.

Well try to BOOT from USB :) and detach and then compare...
Mounting user data and running user-space tools out of USB is
uncomparable...

Systems would also grind to a halt from user-data and not system files.

I know booting from USB can be 1000x slower than user data.

But shared page cache for all devices is bad design, period.

Post by Zdenek Kabelac
AFAIK - this is still not resolved issue...

That's a shame.

Post by Zdenek Kabelac
You can have different pools and you can use rootfs with thins to
easily test i.e. system upgrades....

Sure but in the past GRUB2 would not work well with thin, I was basing
myself on that...

/boot cannot be on thin
/rootfs is not a problem - there will be even some great enhancement for Grub
to support this more easily and switching between various snapshots...

That's great, like with BTRFS I guess that this is possible?

But /rootfs was a problem. Grub-probe reported that it could not find
the rootfs.

When I ran with custom grub config it worked fine. It was only
grub-probe that failed, nothing else (Kubuntu 16.04).

Post by Xen
EVERYONE would benefit.

Fortunately most users NEVER need it ;)

You're wrong. The assurance of a system not crashing (for instance) or
some sane behaviour in case of fill-up, will put many minds at ease.

Post by Zdenek Kabelac
Since they properly operate thin-pool and understand it's weak
points....

Yes they are all superhumans right.

I am sorry for being so inferior ;-).

Post by Xen
Not necessarily that the system continues in full operation,
applications are allowed to crash or whatever. Just that system does
not lock up.

About rootfs, I agree.

But the nominal distinction was between thin-as-system and thin-as-data.

If you say that thin-as-data is specific use case that cannot be
tailored for, that is a bit odd. It is still 90% of use.

Post by Zdenek Kabelac
Once again - USE different pool - solve problems at proper level....
Do not over-provision critical volumes...

Again what we want is a valid use case and a valid request.

If the system is designed so badly (or designed in such a way) that it
cannot be achieved, that does not immediately make it a bad wish.

For example if a problem is caused by the page-cache of the kernel being
for all block devices at once, then anyone wanting something that is
impossible because of that system...

...does not make that person bad for wanting it.

It makes the kernel bad for not achieving it.

I am sure your programmers are good enough to achieve asynchronous
state-updating for a thin-pool that does not interfere with allocation
to the extent that it will lazily update stats and which point
allocation constraints might be basing themselves on older data (maybe
seconds old) but that still doesn't mean it is useless.

It doesn't have to be perfect.

If my "critical volume" wants 1000 free extents, but it only has 988,
that is not so great a problem.

Of course, I know, I hear you say "Use a different pool".

The whole idea for thin is resource efficiency.

There is no real reason that this "space reservation" can't happen.

Even if due to current design limitations, that might be there for a
good reason, you are the arbiter on that.

It cannot be perfect or has to happen asynchronously.

It is better if non-critical volume starts failing than critical volume.

Failure is imminent, but we can choose which fails first.

I mean your argument is no different from.

"We need better man pages."

"REAL system administrators can use current man pages just fine."

"But any improvement would also benefit them, no need for them to do
hard stuff when it can be easier."

"Since REAL system administrators can do their job as it is, our
priorities lie elsewhere."

It's a stupid argument.

Any investment in user friendliness pays off for everyone.

Linux is often so impossible to use because no one makes that
investment, even though it would have immeasurable benefits for
everyone.

And then when someone does make the effort (e.g. makefile that displays
help screen when run with no arguments) someone complains that it breaks
the contract that "make" should start compiling instantly, thus using
"status quo" as a way to never improve anything.

In this case, make "help screen" can save people litterally hours of
time, multiplied by a 1000 people at least.

Post by Xen
I.e. filesystem may guess about thin layout underneath and just write
1 byte to each block it wants to allocate.

In this case it seems that if this is possible for regular files (and
directories in that sense) it should also be possible for "magic" files
and directories that only exist to allocate some space somewhere. In any
case it is FS issue, not LVM.

Besides, you only strengthen my argument that it isn't FS that should be
doing it.

Post by Zdenek Kabelac
Please finally stop thinking about some 'reserved' storage for
critical volume. It leads to nowhere....

It leads to you trying to convince me it isn't possible.

But no matter how much you try to dissuade, it is still an acceptable
use case and desire.

Post by Zdenek Kabelac
Do the right action at right place.
For critical volume use non-overprovisiong pools - there is nothing
better you can do - seriously!

For Gionatan's use case the problem was poor performance of
non-overprovisioning system.

Post by Zdenek Kabelac
Maybe start to understand how kernel works in practice ;)

Or how it doesn't work ;-).

Like,

I will give stupid example.

Suppose using a pen is illegal.

Now lots of people want to use pen, but they end up in jail.

Now you say "Wanting to use pen is bad desire, because of consequences".

But it's pretty clear the desire won't go away.

And the real solution needs to be had at changing the law.

In this case, people really want something and for good reasons. If
there are structural reasons that it cannot be achieved, that is just
that.

That doesn't mean the desires are bad.

You can forever keep saying "Do this instead" but that still doesn't
ever make the prime desires bad.

"Don't use a pen, use a pencil. Problem solved."

Doesn't make wanting to use a pen a bad desire, nor does it make wanting
some safe space in provisioning a bad desire ;-).

Post by Zdenek Kabelac
Otherwise you spend you live boring developers with ideas which simply
cannot work...

Or maybe changing their mind, who knows ;-).

Post by Zdenek Kabelac
So use 2 different POOLS, problem solved....

Was not possible for Gionatan's use case.

Myself I do not use critical volume, but I can imagine still wanting
some space efficiency even when "criticalness" from one volume to the
next differs.

It is proper desire Zdenek. Even if LVM can't do it.

Post by Zdenek Kabelac
Well it's always about checking 'upstream' first and then bothering
your upstream maintainer...

If you knew about the pre-existing problems, you could have informed me.

In fact it has happened that you said something cannot be done, and then
someone else said "Yes, this has been a problem, we have been working on
it and problems should be resolved now in this version".

You spend most of your time denying that something is wrong.

And then someone else says "Yes, this has been an issue, it is resolved
now".

If you communicate more clearly then you also have less people bugging
you.

Post by Zdenek Kabelac
We really cannot be solving problems of every possible deployed
combination of software.

The issue is more that at some point this was the main released version.

Main released kernel and main released LVM, in a certain sense.

Some of your colleagues are a little more forthcoming with
acknowledgements that something has been failing.

This would considerably cut down the amount of time you spend being
"bored" because you try to fight people who are trying to tell you
something.

If you say "Oh yes, I think you mean this and that, yes that's a problem
and we are working on it" or "Yes, that was the case before, this
version fixes that" then

these long discussions also do not need to happen.

But you almost never say "Yes it's a problem", Zdenek.

That's why we always have these debates ;-).

Gionatan Danti

7 years ago

...

Having benchmarked them, I can reply :)

ZFS/ZVOLs surely are slower than thinp, full stop.
However, they are not *massively* slower.

To tell the truth, what somewhat slow me on ZFS adoption is its low
integration in the Linux kernel subsystem. For example:
- cache duplication (ARC + pagecache)
- slow reclaim of memory used for caching
- SPL (sun porting layer)
- dependency on 3rd part module
...

Thinp is great tech - and I am already using it. I did not started this
thread as ZFS vs Thinp, really. Rather, I would like to understand how
to better use thinp, and I traced a parallelism with ZVOLs, nothing more.

Thanks.

--
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

...

Users interested in thin-provisioning are really mostly interested in
performance - especially on multicore machines with lots of fast storage with
high IOPS throughput (some of them even expect it should be at least as good
as linear....)

So ATM it's preferred to have more complex 'corner-case' which really mostly
never happens when thin-pool is operated properly and the remaining use case
you don't pay higher price for having all data always in sync and also you get
way lower memory foot-print
(I think especially ZFS is well known for nontrivial memory resource consumption)

As has been pointed already few times in this thread - lots of those
'reserved space' ideas can be already easily handled by just more advanced
scripting around notification from dmeventd - if you will keep thinking for a
while you will at some point see the reasoning.

There is no difference if you start to solve problem around 70% fullness then
100% - the main difference is - with some 'free-space' in thin-pool you can
resolve problem way more easily and correctly.

Repeated again - whoever targets for 100% full thin-pool usage has
misunderstood purpose of thin-provisioning.....

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
Users interested in thin-provisioning are really mostly interested in
performance - especially on multicore machines with lots of fast
storage with high IOPS throughput (some of them even expect it should
be at least as good as linear....)

Why don't you hold a survey?

And not phrase it in terms of "Would you like to sacrifice performance
for more safety?"

But please.

Ask people:

1) What area does the LVM team needs to focus on for thin provisioning:

a) Performance and keeping performance intact
b) Safety and providing good safeguards against human and program error
c) User interface and command line tools
d) Monitoring and reporting software and systems
e) Graphical user interfaces
f) Integration into default distributions and support for booting/grub

And then allow people to score these things with a percentage or to
distribute some 20 points across these 6 points.

Invent more points as needed.

Give people 20 points to distribute across some 8 areas of interest.

Then ask people what areas are most interesting to them.

So topics could be:
(a) Performance (b) Robustness (c) Command line user interface (d)
Monitoring systems (e) Graphical user interface (f) Distribution support

So ask people. Don't assume.

(NetworkManager team did this pretty well by the way. They were really
interested in user perception some time ago).

Post by Zdenek Kabelac
if you will keep thinking for a while you will at some point see the
reasoning.

Only if your reasoning is correct. Not if your reasoning is wrong.

I could also say to you, we could also say to you "If you think longer
on this you will see we are right". That would probably be more accurate
even.

Post by Zdenek Kabelac
Repeated again - whoever targets for 100% full thin-pool usage has
misunderstood purpose of thin-provisioning.....

Again, no one "targets" for 100% full. It is just an eventuality we need
to take care of.

You design for failure.

A nuclear plant who did not take account of operator drunkenness and had
no safety measures in place to ensure that would not lead to
catastrophe, would be a very bad nuclear plant.

Human error can be calculated into the design. In fact, it must.

DESIGN FOR HUMAN WEAKNESS.

NOT EVERYONE IS PERFECT and human faults happen.

If I was a customer and I was paying your bills, you would never respond
like this.

We like some assurance that things do not go immediate mayhem the moment
someone somewhere slacks off and falls asleep.

We like to design in advance so we do not have to keep a constant eye
out.

We build "structure" so that the structure works for us, and not
constant vigilance.

Constant vigilance can fail. Structure cannot.

Focus on "being" not "doing".

Zdenek Kabelac

7 years ago

Why don't you hold a survey?
And not phrase it in terms of "Would you like to sacrifice performance for
more safety?"
But please.

Post by Zdenek Kabelac
Repeated again - whoever targets for 100% full thin-pool usage has
misunderstood purpose of thin-provisioning.....

Again, no one "targets" for 100% full. It is just an eventuality we need to
take care of.
You design for failure.

...

Thin-pool IS designed for failure - who said it isn't ?

It has very matured protection against data corruption.

It's just not getting overcomplicated in-kernel - solution is left for the
user-space - that's very clear design of 'dm' for decades...

Post by Xen
If I was a customer and I was paying your bills, you would never respond like
this.

We are very nice to customers which pays our bills....

Post by Xen
We like to design in advance so we do not have to keep a constant eye out.

Please if you can show the case where the current upstream thinLV fails and
you lose your data - we can finally start to fix something.

I'm still unsure what problem you want to get resolved from pretty small group
of people around dm/lvm2 - do you want from us to rework kernel page-cache ?

I'm simply still confused what kind action you expect...

Be specific with real world example.

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
Please if you can show the case where the current upstream thinLV
fails and you lose your data - we can finally start to fix something.

Hum, I can only say "I owe you one" on this.

I mean to say it will have to wait, but I hope to get to this at some
point.

Post by Zdenek Kabelac
I'm still unsure what problem you want to get resolved from pretty
small group of people around dm/lvm2 - do you want from us to rework
kernel page-cache ?
I'm simply still confused what kind action you expect...
Be specific with real world example.

I think Brassow Jonathan's idea is very good to begin with (thank you
sir ;-)).

I get that you say that kernel space solution is impossible to implement
(apart from not crashing the system, and I get that you say that this is
no longer the case) because checking several things would prolong
execution paths considerably, is what you say.

And I realize that any such thing would need asynchronous checking and
updating some values and then execution paths that need to check for
such things which I guess could indeed by rather expensive to actually
execute.

I mean the only real kernel experience I have was trying to dabble with
filename_lookup and path_lookupat or whatever it was called. I mean
inode path lookups, which is a bit of the same thing. And indeed even a
single extra check would have incurred a performance overhead.

I mean the code to begin with differentiated between fast lookup and
slow lookup and all of that.

And particularly the fast lookup was not something you'd want to mess
with, etc.

But I absolutely have no issue to begin with I want to say with
asynchronous 'intervention' even if it is not byte accurate, as you say
in the other email.

And I get that you prefer user-space tools doing the thing...

And you say there that this information is hard to mine.

And that the "thin_ls" tool does that.

It's just that I don't want it to be 'random' and depending on your
particular random sysadmin doing the right thing in isolation of all
other random sysadmins having to do the right thing all in isolation of
each other all writing the same code.

At the very least if you recognise your responsibility, which you are
doing now, we can have a bit of a framework that is delivered by
upstream LVM so the thing comes out more "fully fleshed" and sysadmins
have less work to do, even if they still have to customize the scripts
or anything.

Most ideal thing would definitely be something you "set up" and then the
thing takes care of itself, ie. you only have to input some values and
constraints.

But intervention in forms of "fsfreeze" or whatever is very personal, I
get that.

And I get that previously auto-unmounting also did not really solve
issues for everyone.

So a general interventionalist policy that is going to work for everyone
is hard to get.

So the only thing that could work for everyone is if there is actually a
block on new allocations. If that is not possible, then indeed I agree
that a "one size fits all" approach is hardly possible.

Intervention is system-specific.

Regardless at least it should be easy to ensure that some constraints
are enforced, that's all I'm asking.

Regards, (I'll respond further in the other email).

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
What's wrong with BTRFS....
Either you want fs & block layer tied together - that the btrfs/zfs approach

BTRFS really has a ton of performance problem - please, don't recommend
it for anything IO intensive (as virtual machines and databases).

Moreover, RedHat now officially deprecated it...

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Brassow Jonathan

7 years ago

Hi,

I’m the manager of the LVM/DM team here at Red Hat. Let me thank those of you who have taken the time to share how we might improve LVM thin-provisioning. We really do appreciate it and you ideas are welcome.

I see merit in the ideas you’ve presented and if I’ve got it right, there are two main ones:
1) don’t allow creation of new thinLVs or snapshots in a pool that is beyond a certain threshold
2) allow users to reserve some space for critical volumes when a threshold is reached

I believe that #1 is already handled, are you looking for anything else?

#2 doesn’t seem crazy hard to implement - even in script form. In RHEL7.4 (upstream = "Version 2.02.169 - 28th March 2017”), we introduced the lvm.conf:dmeventd/thin_command setting. You can run anything you want through a script. Right now, it is set to do lvextend in an attempt to add more space to a filling thin-pool. However, you don’t need to be so limited. I imagine the following:
- Add a “critical” tag to all thinLVs that are very important:
# lvchange --addtag critical vg/thinLV
- Create script that is called by thin_command, it should:
- check if a threshold is reached (i.e. your reserved space) and if so,
- report all lvs associated with the thin-pool that are NOT critical:
# lvs -o name --noheadings --select 'lv_tags!=critical && pool_lv=thin-pool’ vg
- run <command> on those non-critical volumes, where <command> could be:
# fsfreeze <mnt_point>

The above should have the result you want - essentially locking out all non-critical file systems. The admin can easily turn them back on via fsfreeze one-by-one as they resolve the critical lack of space. If you find this too heavy-handed, perhaps try something else for <command> instead first.

If the above is sufficient, then great. If you’d like to see something like this added to the LVM repo, then you can simply reply here with ‘yes’ and maybe provide a sentence of what the scenario is that it would solve. (I know there are already some listed in this thread, but I’m wondering about those folks that think the script is insufficient and believe this should be more standard.)

Thanks,
brassow

Gionatan Danti

7 years ago

Hi Jonathan,

Post by Brassow Jonathan
Hi,
I’m the manager of the LVM/DM team here at Red Hat. Let me thank
those of you who have taken the time to share how we might improve LVM
thin-provisioning. We really do appreciate it and you ideas are
welcome.
I see merit in the ideas you’ve presented and if I’ve got it right,
1) don’t allow creation of new thinLVs or snapshots in a pool that is
beyond a certain threshold
2) allow users to reserve some space for critical volumes when a threshold is reached
I believe that #1 is already handled, are you looking for anything else?

...

Yeah, this is coverd by the appropriate use of
snapshot_autoextend_percent. I did not realized that, thanks to Zdenek
for pointing me to the right direction.

Post by Brassow Jonathan
#2 doesn’t seem crazy hard to implement - even in script form. In
RHEL7.4 (upstream = "Version 2.02.169 - 28th March 2017”), we
introduced the lvm.conf:dmeventd/thin_command setting. You can run
anything you want through a script. Right now, it is set to do
lvextend in an attempt to add more space to a filling thin-pool.
# lvchange --addtag critical vg/thinLV
- check if a threshold is reached (i.e. your reserved space) and if so,
# lvs -o name --noheadings --select 'lv_tags!=critical &&
pool_lv=thin-pool’ vg
# fsfreeze <mnt_point>
The above should have the result you want - essentially locking out
all non-critical file systems. The admin can easily turn them back on
via fsfreeze one-by-one as they resolve the critical lack of space.
If you find this too heavy-handed, perhaps try something else for
<command> instead first.

...

Very good suggestion. Actually, fsfreeze should works without too much
drama.

Post by Brassow Jonathan
If the above is sufficient, then great. If you’d like to see
something like this added to the LVM repo, then you can simply reply
here with ‘yes’ and maybe provide a sentence of what the scenario is
that it would solve. (I know there are already some listed in this
thread, but I’m wondering about those folks that think the script is
insufficient and believe this should be more standard.)

Yes, surely.

The combination of #1 and #2 should give the desired outcome (I quickly
tested it and I found no evident problems).

Jonathan, Zdeneck, thanks you very much.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

D # fsfreeze <mnt_point>

Post by Brassow Jonathan
The above should have the result you want - essentially locking out
all non-critical file systems. The admin can easily turn them back on
via fsfreeze one-by-one as they resolve the critical lack of space.
If you find this too heavy-handed, perhaps try something else for
<command> instead first.

Very good suggestion. Actually, fsfreeze should works without too much drama.

Think about this case:

original volume with number of timely taken snapshots.

If you ONLY use 'read-only' snaps - there is not much thing to do - writing to
the origin gives you quite 'precise' estimation how much data are in progress.
(seeing amount of dirty-pages....)

However when all other snapshots (i.e. VM machines) are in-use and also do
have writable data in progress - invoking 'fsfreeze' operation has
unpredictable amount of provisioning in front of you (all your dirty pages
needs to be first committed on your disk)...

So you can easily 'freeze' yourself in 'fsfreeze'.

lvm2 has got over last year much smarter - and avoids i.e. flushing in case
it's queering of used 'data-space' with 2 consequences:

a) it prevents 'dead-lock' in suspending with flushing (and holding lvm2 VG
locking - which was really bad bad bad problem.... as you could not run
'lvextend' for thin-pool in such case to rescue situation (i.e. you still have
free space in VG - or even extend your VG...

b) gives you some 'historical/unprecise/async' runtime data of thin-pool fullness

So you can start to see that doing some 'perfect' decision with historical
data is not easy task...

Reagards

Zdenek

Xen

7 years ago

Post by Brassow Jonathan
I’m the manager of the LVM/DM team here at Red Hat.

Thank you for responding.

Post by Brassow Jonathan
2) allow users to reserve some space for critical volumes when a threshold is reached
#2 doesn’t seem crazy hard to implement - even in script form.
# lvchange --addtag critical vg/thinLV
- check if a threshold is reached (i.e. your reserved space) and if so,
# lvs -o name --noheadings --select 'lv_tags!=critical &&
pool_lv=thin-pool’ vg
# fsfreeze <mnt_point>

I think the above is exactly (or almost exactly) in agreement with the
general idea yes.

It uses filesystem tool to achieve it instead of allocation blocking (so
filesystem level, not DM level).

But if it does the same thing that is more important than having
'perfect' solution.

The issue with scripts is that they feel rather vulnerable to
corruption, not being there etc.

So in that case I suppose that you would want some default, shipped
scripts that come with LVM as example for default behaviour and that are
also activated by default?

So fixed location in FHS for those scripts and where user can find then
and can install new ones.

Something similar to /etc/initramfs-tools/ (on Debian), so maybe
/etc/lvm/scripts/ and /usr/share/lvm/scripts/ or similar.

Also easy to adjust by each distribution if they wanted to.

If no one uses critical tag -- nothing happens, but if they do use it,
check unallocated space on critical volumes and sum it up to arrive at
threshold value?

Then not even a threshold value needs to be configured.

Yes. One obvious scenario is root on thin.

It's pretty mandatory for root on thin.

There is something else though.

You cannot set max size for thin snapshots?

This is part of the problem: you cannot calculate in advance what can
happen, because by design, mayhem should not ensue, but what if your
predictions are off?

Being able to set a maximum snapshot size before it gets dropped could
be very nice.

This behaviour is very safe on non-thin.

It is inherently risky on thin.

Post by Brassow Jonathan
(I know there are already some listed in this
thread, but I’m wondering about those folks that think the script is
insufficient and believe this should be more standard.)

You really want to be able to set some minimum free space you want per
volume.

Suppose I have three volumes of 10GB, 20GB and 3GB.

I may want the 20GB volume to be least important. The 3GB volume most
important. The 10GB volume in between.

I want at least 100MB free on 3GB volume.

When free space on thin pool drops below ~120MB, I want the 20GB volume
and the 10GB volumes to be frozen, no new extents for 30GB volume.

I want at least 500MB free on 10GB volume.

When free space on thin pool drops below ~520MB, I want the 20GB volume
to be frozen, no new extents for 20GB volume.

So I would get 2 thresholds and actions:

- threshold for 3GB volume causing all others to be frozen
- threshold for 10GB volume causing 20GB volume to be frozen

This is easily scriptable and custom thing.

But it would be nice if you could set this threshold in LVM per volume?

So the script can read it out?

100MB of 3GB = 3.3%
500MB of 10GB = 5%

3-5% of mandatory free space could be a good default value.

So the default script could also provide a 'skeleton' for reading the
'critical' tag and then calculating a default % of space that needs to
be free.

In this case there is a hierarchy:

3GB > 10GB > 20GB.

Any 'critical volume' could cause all others 'beneath it' to be frozen.

But the most important thing is to freeze or drop snapshots I think.

And to ensure that this is default behaviour?

Or at least provide skeletons for responding to thin threshold values
being reached so that the burden on the administrator is very minimal.

Zdenek Kabelac

7 years ago

Post by Xen
There is something else though.
You cannot set max size for thin snapshots?

We are moving here in right direction.

Yes - current thin-provisiong does not let you limit maximum number of blocks
individual thinLV can address (and snapshot is ordinary thinLV)

Every thinLV can address exactly LVsize/ChunkSize blocks at most.

Post by Xen
This is part of the problem: you cannot calculate in advance what can happen,
because by design, mayhem should not ensue, but what if your predictions are off?

Great - 'prediction' - we getting on the same page - prediction is big
problem....

Post by Xen
Being able to set a maximum snapshot size before it gets dropped could be very
nice.

You can't do that IN KERNEL.

The only tool which is able to calculate real occupancy - is user-space
thin_ls tool.

So all you need to do is to use the tool in user-space for this task.

Post by Xen
This behaviour is very safe on non-thin.
It is inherently risky on thin.

Post by Brassow Jonathan
(I know there are already some listed in this
thread, but I’m wondering about those folks that think the script is
insufficient and believe this should be more standard.)

...

This is the main issue - these 'data' are pretty expensive to 'mine' out of
data structures.

That's the reason why thin-pool is so fast and memory efficient inside the
kernel - because it does not need to all those details about how much data
thinLV eat from thin-pool - kernel target simply does not care - it only cares
about referenced chunks

It's the user space utility which is able to 'parse' all the structure
and take a 'global' picture. But of course it takes CPU and TIME and it's not
'byte accurate' - that's why you need to start act early on some threshold.

Post by Xen
But the most important thing is to freeze or drop snapshots I think.
And to ensure that this is default behaviour?

Why you think this should be default ?

Default is to auto-extend thin-data & thin-metadata when needed if you set
threshold bellow 100%.

We can discuss if it's good idea to enable auto-extending by default - as we
don't know if the free space in VG is meant to be used for thin-pool or there
is some other plan admin might have...

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
We are moving here in right direction.
Yes - current thin-provisiong does not let you limit maximum number of
blocks individual thinLV can address (and snapshot is ordinary thinLV)
Every thinLV can address exactly LVsize/ChunkSize blocks at most.

So basically the only options are allocation check with asynchronously
derived intel that might be a few seconds late, as a way to execute some
standard and general "prioritizing" policy, and an interventionalist
policy that will (fs)freeze certain volumes depending on admin knowledge
about what needs to happen in his/her particular instance.

Post by Xen
This is part of the problem: you cannot calculate in advance what can
happen, because by design, mayhem should not ensue, but what if your
predictions are off?

Great - 'prediction' - we getting on the same page - prediction is
big problem....

Yes I mean my own 'system' I generally of course know how much data is
on it and there is no automatic data generation.

Matthew Patton referenced quotas in some email, I didn't know how to do
it as quickly when I needed it so I created a loopback mount from a
fixed sized container to 'solve' that issue when I did have an
unpredictable data source... :p.

But if I do create snapshots (which I do every day) when the root and
boot snapshots fill up (they are on regular lvm) they get dropped which
is nice, but particularly the big data volume if I really were to move a
lot of data around I might need to first get rid of the snapshots or
else I don't know what will happen or when.

Also my system (yes I am an "outdated moron") does not have thin_ls tool
yet so when I was last active here and you mentioned that tool (thank
you for that, again) I created this little script that would give me
also info:

$ sudo ./thin_size_report.sh
[sudo] password for xen:
Executing self on linux/thin
Individual invocation for linux/thin

name pct size
---------------------------------
data 54.34% 21.69g
sites 4.60% 1.83g
home 6.05% 2.41g
--------------------------------- +
volumes 64.99% 25.95g
snapshots 0.09% 24.00m
--------------------------------- +
used 65.08% 25.97g
available 34.92% 13.94g
--------------------------------- +
pool size 100.00% 39.91g

The above "sizes" are not volume sizes but usage amounts.

And the % are % of total pool size.

So you can see I have 1/3 available on this 'overprovisioned' thin pool
;-).

But anyway.

Post by Xen
Being able to set a maximum snapshot size before it gets dropped could
be very nice.

You can't do that IN KERNEL.
The only tool which is able to calculate real occupancy - is
user-space thin_ls tool.

Yes my tool just aggregated data from "lvs" invocations to calculate the
numbers.

If you say that any additional allocation checks would be infeasible
because it would take too much time per request (which still seems odd
because the checks wouldn't be that computation intensive and even for
100 gigabyte you'd only have 25.000 checks at default extent size) -- of
course you asynchronously collect the data.

So I don't know if it would be *that* slow provided you collect the data
in the background and not while allocating.

I am also pretty confident that if you did make a policy it would turn
out pretty good.

I mean I generally like the designs of the LVM team.

I think they are some of the most pleasant command line tools anyway...

But anyway.

On the other hand if all you can do is intervene in userland, then all
LVM team can do is provide basic skeleton for execution of some standard
scripts.

Post by Zdenek Kabelac
So all you need to do is to use the tool in user-space for this task.

So maybe we can have an assortment of some 5 interventionalist policies
like:

a) Govern max snapshot size and drop snapshots when they exceed this
b) Freeze non-critical volumes when thin space drops below aggegrate
values appropriate for the critical volumes
c) Drop snapshots when thin space <5% starting with the biggest one
d) Also freeze relevant snapshots in case (b)
e) Drop snapshots when exceeding max configured size in case of
threshold reach.

So for example you configure max size for snapshot. When snapshots
exceeds size gets flagged for removal. But removal only happens when
other condition is met (threshold reach).

So you would have 5 different interventions you could use that could be
considered somewhat standard and the admit can just pick and choose or
customize.

Post by Zdenek Kabelac
This is the main issue - these 'data' are pretty expensive to 'mine'
out of data structures.

But how expensive is it to do it say every 5 seconds?

Post by Zdenek Kabelac
It's the user space utility which is able to 'parse' all the structure
and take a 'global' picture. But of course it takes CPU and TIME and
it's not 'byte accurate' - that's why you need to start act early on
some threshold.

I get that but I wonder how expensive it would be to do that
automatically all the time in the background.

It seems to already happen?

Otherwise you wouldn't be reporting threshold messages.

In any case the only policy you could have in-kernel would be either
what Gionatan proposed (fixed reserved space for certain volumes) (easy
calculation right) or potentially allocation freeze at threshold for
non-critical volumes,

I say you only implement per-volume space reservation, but anyway.

I just still don't see how one check per 4MB would be that expensive
provided you do data collection in background.

You say size can be as low as 64kB... well.... in that case...

You might have issues.

But in any case,

a) For intervention, choice is between customization by code and
customization by values.
b) Ready made scripts could take values but could also be easy to
customize
c) Scripts could take values from LVM config or volume config but must
be easy to know/change/know about.

d) Scripts could document where to set the values.

e) Personally I would do the following:

a) Stop snapshots from working when a threshold is reached (95%) in a
rapid fasion

or

a) Just let everything fill up as long as system doesn't crash

b) Intervene to drop/freeze using scripts, where

1) I would drop snapshots starting with the biggest one in case of
threshold reach (general)

2) I would freeze non-critical volumes ( I do not write to
snapshots so that is no issue ) when critical volumes reached safety
threshold in free space ( I would do this in-kernel if I could ) ( But
Freezing In User-Space is almost the same ).

3) I would shrink existing volumes to better align with this
"critical" behaviour because now they are all large size to make moving
data easier

4) I would probably immediately implement these strategies if the
scripts were already provided

5) Currently I already have reporting in place (by email) so I
have no urgent need myself apart from still having an LVM version that
crashes

f) For a critical volume script, it is worth considering that small
volumes are more likely to be critical than big ones, so this could also
prompt people to organize their volumes in that way, and have a standard
mechanism to first protect the free space of smaller volumes against all
of the bigger ones, then the next up is only protected against ITS
bigger ones, and so on.

Basically when you have Big, Medium and Small, Medium is protected
against Big, and Small is protected against both others.

So the Medium protection is triggered sooner because it has a higher
space need compared to the Small volume, so Big is frozen before Medium
is frozen.

So when space then runs out, first Big is frozen, and when that doesn't
help, in time Medium is also frozen.

Seems pretty legit I must say.

And this could be completely unconfigured, just a standard recipe using
for configuration only the percentage you want to use.

Ie. you can say I want 5% free on all volumes from the top down, and
only the biggest one isn't protected, but all the smaller ones are.

If several are the same size you lump them together.

Now you have a cascading system in which if you choose this script, you
will have "Small ones protected against Big ones" protection in which
you really don't have to set anything up yourself.

You don't even have to flag them as critical...

Sounds like fun to make in any case.

g) There is a little program called "pam_shield" that uses
"shield_triggers" to select which kind of behaviour the user wants to
use in blocking external IPs. It provides several alternatives such as
IP routing block (blackhole) and iptables block.

You can choose which intervention you want. The scripts are already
provided. You just have to select the one you want.

Post by Xen
And to ensure that this is default behaviour?

Why you think this should be default ?
Default is to auto-extend thin-data & thin-metadata when needed if you
set threshold bellow 100%.

Q: In a 100% filled up pool, are snapshots still going to be valid?

Could it be useful to have a default policy of dropping snapshots at
high consumption? (ie. 99%). But it doesn't have to be default if you
can easily configure it and the scripts are available.

So no, if the scripts are available and the system doesn't crash as you
say it doesn't anymore, there does not need to be a default.

Just documented.

I've been condensing this email.

You could have a script like:

#!/bin/bash

# Assuming $1 is the thin pool I am getting executed on, that $2 is the
threshold that
# has been reached, and $3 is the free space available in pool

MIN_FREE_SPACE_CRITICAL_VOLUMES_PCT=5

1. iterate critical volumes
2. calculate needed free space for those volumes based on above value
3. check against the free space in $3

4. perform action

Well I am not saying anything new here compared to Brassow Jonathan.

But it could be that simple to have a script you don't even need to
configure.

More sophisticated then would be a big vs small script in which you
don't even need to configure the critical volumes.

So to sum up my position is still:

a) Personally I would still prefer in-kernel protection based on quotas
b) Personally I would not want anything else from in-kernel protection
c) No other policies than that in the kernel
d) Just allocation block based on quotas based on lazy data collection

e) If people really use 64kB chunksizes and want max performance then
it's not for them
f) The analogy of the aeroplane that runs out of fuel and you have to
choose which passengers to eject does not apply if you use quotas.

g) I would want more advanced policy or protection mechanisms
(intervention) in userland using above ideas.

h) I would want inclusion of those basic default scripts in LVM upstream

i) The model of "shield_trigger" of "pam_shield" is a choice between
several default interventions

Post by Zdenek Kabelac
We can discuss if it's good idea to enable auto-extending by default -
as we don't know if the free space in VG is meant to be used for
thin-pool or there is some other plan admin might have...

I don't think you should. Any admin that uses thin and that intends to
auto-extend, will be able to configure so anyway.

When I said I wanted default, it is more like "available by default"
than "configured by default".

Using thin is a pretty conscious choice.

As long as it is easy to activate protection measures, that is not an
issue and does not need to be default imo.

Priorities for me:

1) Monitoring and reporting
2) System could block allocation for critical volumes
3) I can drop snapshots starting with the biggest one in case of <5%
pool free
4) I can freeze volumes when space for critical volumes runs out

Okay sending this now. I tried to summarize.

See ya.

Zdenek Kabelac

7 years ago

So basically the only options are allocation check with asynchronously derived
intel that might be a few seconds late, as a way to execute some standard and
general "prioritizing" policy, and an interventionalist policy that will
(fs)freeze certain volumes depending on admin knowledge about what needs to
happen in his/her particular instance.

...

Basically user-land tool takes a runtime snapshot of kernel metadata
(so gets you information from some frozen point in time) then it processes the
input data (up to 16GiB!) and outputs some number - like what is the
real unique blocks allocated in thinLV. Typically snapshot may share some
blocks - or could have already be provisioning all blocks in case shared
blocks were already modified.

Post by Zdenek Kabelac
Great - 'prediction' - we getting on the same page - prediction is
big problem....

Yes I mean my own 'system' I generally of course know how much data is on it
and there is no automatic data generation.

However lvm2 is not 'Xen oriented' tool only.
We need to provide universal tool - everyone can adapt to their needs.

Since your needs are different from others needs.

But if I do create snapshots (which I do every day) when the root and boot
snapshots fill up (they are on regular lvm) they get dropped which is nice,

old snapshot are different technology for different purpose.

$ sudo ./thin_size_report.sh
Executing self on linux/thin
Individual invocation for linux/thin
    name               pct       size
    ---------------------------------
    data            54.34%     21.69g
    sites            4.60%      1.83g
    home             6.05%      2.41g
    --------------------------------- +
    volumes         64.99%     25.95g
    snapshots        0.09%     24.00m
    --------------------------------- +
    used            65.08%     25.97g
    available       34.92%     13.94g
    --------------------------------- +
    pool size      100.00%     39.91g
The above "sizes" are not volume sizes but usage amounts.

...

With 'plain' lvs output is - it's just an orientational number.
Basically highest referenced chunk for a thin given volume.
This is great approximation of size for a single thinLV.
But somewhat 'misleading' for thin devices being created as snapshots...
(having shared blocks)

So you have no precise idea how many blocks are shared or uniquely owned by a
device.

Removal of snapshot might mean you release NOTHING from your thin-pool if all
snapshot blocks where shared with some other thin volumes....

If you say that any additional allocation checks would be infeasible because
it would take too much time per request (which still seems odd because the
checks wouldn't be that computation intensive and even for 100 gigabyte you'd
only have 25.000 checks at default extent size) -- of course you
asynchronously collect the data.

Processing of mapping of upto 16GiB of metadata will not happen in
miliseconds.... and consumes memory and CPU...

I mean I generally like the designs of the LVM team.
I think they are some of the most pleasant command line tools anyway...

We try really hard....

On the other hand if all you can do is intervene in userland, then all LVM
team can do is provide basic skeleton for execution of some standard scripts.

Yes - we give all the power to suit thin-p for individual needs to the user.

Post by Zdenek Kabelac
So all you need to do is to use the tool in user-space for this task.

a) Govern max snapshot size and drop snapshots when they exceed this
b) Freeze non-critical volumes when thin space drops below aggegrate values
appropriate for the critical volumes
c) Drop snapshots when thin space <5% starting with the biggest one
d) Also freeze relevant snapshots in case (b)
e) Drop snapshots when exceeding max configured size in case of threshold reach.

But you are aware you can run such task even with cronjob.

So for example you configure max size for snapshot. When snapshots exceeds
size gets flagged for removal. But removal only happens when other condition
is met (threshold reach).

We are blamed already for having way too much configurable knobs....

So you would have 5 different interventions you could use that could be
considered somewhat standard and the admit can just pick and choose or customize.

And we have way longer list of actions we want to do ;) We have not yet come
to any single conclusion how to make such thing manageable for a user...

But how expensive is it to do it say every 5 seconds?

If you have big metadata - you would keep you Intel Core busy all the time ;)

That's why we have those thresholds.

Script is called at 50% fullness, then when it crosses 55%, 60%, ... 95%,
100%. When it drops bellow threshold - you are called again once the boundary
is crossed...

So you can do different action at different fullness level...

I get that but I wonder how expensive it would be to do that automatically all
the time in the background.

If you are proud sponsor of your electricity provider and you like the extra
heating in your house - you can run this in loop of course...

It seems to already happen?
Otherwise you wouldn't be reporting threshold messages.

Threshold are based on mapped size for whole thin-pool.

Thin-pool surely knows all the time how many blocks are allocated and free for
its data and metadata devices.

(Thought 'lvs' presented numbers are not 'synchronized' - there could be up to
1.second delay between reported & real number)

In any case the only policy you could have in-kernel would be either what
Gionatan proposed (fixed reserved space for certain volumes) (easy calculation
right) or potentially allocation freeze at threshold for non-critical volumes,

In the single thin-pool all thins ARE equal.

Low number of 'data' block may cause tremendous amount of provisioning.

With specifically written data pattern you can (in 1 second!) cause
provisioning of large portion of your thin-pool (if not the whole one in case
you have small one in range of gigabytes....)

And that's the main issue - what we solve in lvm2/dm - we want to be sure
that when thin-pool is FULL - written & committed data are secure and safe.
Reboot is mostly unavoidable if you RUN from a device which is out-of-space -
we cannot continue to use such device - unless you add MORE space to it within
60second window.

All other proposals solve only very localized solution and problems which are
different for every user.

I.e. you could have a misbehaving daemon filling your system device very fast
with logs...

In practice - you would need some system analysis and detect which application
causes highest pressure on provisioning - but that's well beyond range lvm2
team ATM with the amount of developers can provide....

I just still don't see how one check per 4MB would be that expensive provided
you do data collection in background.
You say size can be as low as 64kB... well.... in that case...

Default chunk size if 64k for the best 'snapshot' sharing - the bigger the
pool chunk is the less like you could 'share' it between snapshots...

(As pointed in other thread - ideal chunk for best snapshot sharing would be
4K - but that's not affordable for other reasons....)

2) I would freeze non-critical volumes ( I do not write to snapshots so
that is no issue ) when critical volumes reached safety threshold in free
space ( I would do this in-kernel if I could ) ( But Freezing In User-Space is
almost the same ).

There are lots of troubles when you have freezed filesystems present in your
machine fs tree... - if you know all connections and restrictions - it can be
'possibly' useful - but I can't imagine this being useful in generic case...

And more for your thinking -

If you have pressure on provisioning caused by disk-load on one of your
'critical' volumes this FS 'freezeing' scripting will 'buy' you only couple
seconds (depends how fast drives you have and how big thresholds you will use)
and you are in the 'exact' same situation - expect now you have system in
bigger troubles - and you already might have freezed other systems apps by
having them accessing your 'low-prio' volumes....

And how you will be solving 'unfreezing' in cases thin-pool usage drops down
is also pretty interesting topic on its own...

I need to wish good luck when you will be testing and developing all this
machinery.

Post by Zdenek Kabelac
Default is to auto-extend thin-data & thin-metadata when needed if you
set threshold bellow 100%.

All snapshots/thins with 'fsynced' data are always secure.
Thin-pool is protecting all user-data on disk.

The only lost data are those flying in your memory (unwritten on disk).
And depends on you 'page-cache' setup how much that can be...

Regards

Zdenek

Brassow Jonathan

7 years ago

...

Our general philosophy is, don’t do anything that will corrupt user data. After that, the LVM team wants to put in place the best possible solutions for a generic user set. When it comes to thin-provisioning, the best possible thing we can do that we are certain will not corrupt/loose data and is least likely to cause unintended consequences, is to try to grow the thin-pool. If we are unable to grow and the thin-pool is filling up, it is really hard to “do the right thing”.

There are many solutions that could work - unique to every workload and different user. It is really hard for us to advocate for one of these unique solutions that may work for a particular user, because it may work very badly for the next well-intentioned googler.

We’ve tried to strike a balance of doing the things that are knowably correct and getting 99% of the problems solved, and making the user aware of the remaining problems (like 100% full thin-provisioning) while providing them the tools (like the ‘thin_command’ setting) so they can solve the remaining case in the way that is best for them.

We probably won’t be able to provide any highly refined scripts that users can just plug in for the behavior they want, since they are often so highly specific to each customer. However, I think it will be useful to try to create better tools so that users can more easily get the behavior they want. We want to travel as much distance toward the user as possible and make things as usable as we can for them. From this discussion, we have uncovered a handful of useful ideas (e.g. this bug that Zdenek filed: https://bugzilla.redhat.com/show_bug.cgi?id=1491609) that will make more robust scripts possible. We are also enhancing our reporting tools so users can better sort through LVM information and take action. Again, this is in direct response to the feedback we’ve gotten here.

Thanks,
brassow

Gionatan Danti

7 years ago

...

Excellent, thank you all very much.

From the two proposed solutions (lvremove vs lverror), I think I would
prefer the second one. Obviously with some warnings as "are you sure do
error this active volume"?

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

7 years ago

Post by Brassow Jonathan
There are many solutions that could work - unique to every workload
and different user. It is really hard for us to advocate for one of
these unique solutions that may work for a particular user, because it
may work very badly for the next well-intentioned googler.

Well, thank you.

Of course in the split between saying "it is the administrator's job
that everyone works well" and at the same time saying that those
administrators can be "googlers".

There's a big gap between that. I think that many who do employ thinp
will be at least a bit more serious about it, but perhaps not as serious
that they can devote all the resources to developing all of the
mitigating measures that anyone could want.

So I think the common truth lies more in the middle: they are not
googlers who implement the first random article they find without
thinking about it, and they are not professional people in full time
employment doing this thing.

So because of that fact that most administrators interested in thin like
myself will have read LVM manpages a great deal already on their own
systems...

And any common default targets for "thin_command" could also be well
documented and explained, and pros and cons layed out.

The only thing we are talking about today is reserving space due to some
threshold.

And performing an action when that reservation is threatened.

So this is the common need here.

This need is going to be the same for everyone that uses any scheme that
could be offered.

Then the question becomes: are interventions also as common?

Well there are really only a few available:

a) turning into error volume as per the bug
b) fsfreezing
c) merely reporting
d) (I am not sure if "lvremove" should really be seriously considered).

At this point you have basically exhausted any default options you may
have that are "general". No one actually needs more than that.

What becomes interesting now is the logic underpinning these decisions.

This logic needs some time to write and this is the thing that
administrators will put off.

So they will live with not having any intelligence in automatic response
and will just live with the risk of a volume filling up without having
written the logic that could activate the above measures.

That's the problem.

So what I am advocating for -- I am not disregarding Mr. Zdenek's bug
;-), [1], In fact I think this "lverror" would be very welcome
(paraphrasing here) even though personally I would want to employ a
filesystem mechanic if I am doing this using a userland too anyway!!!

But sure, why not.

I think that is complementary to and orthogonal to the issue of where
the logic is coming from, and that the logic also requires a lot of
resources to write.

So even though you could probably hack it together in some 15 minutes,
and then you need testing etc...

I think it would just be a lot more pleasant if this logic framework
already existed, was tried and tested, did the job correctly, and can
easily be employed by anyone else.

So I mean to say that currently we are only talking about space
reservation.

You can only do this in a number of ways:

- % of total volume size.

- fixed amount configured per volume

And that's basically it.

The former merely requires each volume to be 'flagged' as 'critical' as
suggested.
The latter requires some number to be defined and then flagging is
unnecessary.

The script would ensure that:

- not ALL thin volumes are 'critical'.
- as long as a single volume is non-critical, the operation can continue
- all critical volumes are aggregated in required free space
- the check is done against currently available free space
- the action on the non-critical-volumes is performed if necessary.

That's it. Anyone could use this.

The "Big vs. Small" model is a little bit more involved and requires a
little bit more logic, and I would not mind writing it, but it follows
along the same lines.

*I* say that in this department, *only* these two things are needed.

+ potentially the lverror thing.

So I don't really see this wildgrowth of different ideas.

So personally I would like the "set manual size" more than the "use
percentage" in the above. I would not want to flag volumes as critical,
I would just want to set their reserved space.

I would prefer if I could set this in the LVM volumes themselves, rather
than in the script.

If the script used a percentage, I would want to be able to configure
the percentage outside the script as well.

I would want the script to do the heavy lifting of knowing how to
extract these values from the LVM volumes, and some information on how
to put them there.

(Using tags and all of that is not all that common knowledge I think).

Basically, I want the script to know how to set and retrieve properties
from the LVM volumes.

Then I want it to be easy to see the reserved space (potentially)
(although this can conflict with not being a really integrated feature)
and perhaps to set and change it...

So I think that what is required is really only minimal...

But that doesn't mean it is unnecessary.

Post by Brassow Jonathan
We’ve tried to strike a balance of doing the things that are knowably
correct and getting 99% of the problems solved, and making the user
aware of the remaining problems (like 100% full thin-provisioning)
while providing them the tools (like the ‘thin_command’ setting) so
they can solve the remaining case in the way that is best for them.

I am really happy to learn about these considerations.

I hope that we can see as the result of this today the inclusion of the
script you mentioned in the previous email.

Something that hopefully would use values tagged into volumes, and a
script that would need no modification by the user.

Something that would e.g. be called with the name of the thin pool as
first parameter (pardon my ignorance) and would take care of all the
rest by retrieving values tagged onto volumes.

( I mean that's what I would write, but if I were to write it probably
no one else would ever use it, so .... (He says with a small voice) ).

And personally I would prefer this script to use "fsfreeze" as you
mentioned (I was even not all that aware of this command...) rather than
changing to an error target.

But who knows.

I am not saying it's a bad thing.

Seems risky though.

So honestly I just completely second the script you proposed, mr.
Jonathan.

;-).

While I still don't know why any in-kernel thing is impossible, seeing
that Zdenek-san mentioned overall block availability to be known, and
that you only need overall block availability + some configured values
to impose any sanctions on non-critical volumes.....

I would hardly feel a need for such a measure if the script mentioned
and perhaps the other idea that I like so much of "big vs small" would
be readily available.

I really have no other wishes than that personally.

It's that simple.

Space reservation and big to small protection.

Those are the only things I want.

Now all that's left to do is upgrade my LVM version ;-).

(Hate messing with a Debian install ;-)).

And I feel almost like writing it myself after having talked about it
for so long anyway...

(But that's what happens when you develop ideas).

Post by Brassow Jonathan
We probably won’t be able to provide any highly refined scripts that
users can just plug in for the behavior they want, since they are
often so highly specific to each customer. However, I think it will
be useful to try to create better tools so that users can more easily
get the behavior they want. We want to travel as much distance toward
the user as possible and make things as usable as we can for them.
From this discussion, we have uncovered a handful of useful ideas
https://bugzilla.redhat.com/show_bug.cgi?id=1491609) that will make
more robust scripts possible. We are also enhancing our reporting
tools so users can better sort through LVM information and take
action. Again, this is in direct response to the feedback we’ve
gotten here.

...

Well that's very good. It took me a long time to sort through the
information without the thinls command.

Indeed, if such data is readily available it makes the burden of writing
the script yourself much less as well.

Still I would still vouch for the inclusion of the 2 scripts mentioned:

- space reservation
- big vs. small

And I don't mind writing the second one myself, or at least an example
of it.

Regards.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1491609

Xen

7 years ago

Post by Zdenek Kabelac
Basically user-land tool takes a runtime snapshot of kernel metadata
(so gets you information from some frozen point in time) then it
processes the input data (up to 16GiB!) and outputs some number - like
what is the
real unique blocks allocated in thinLV.

That is immensely expensive indeed.

Post by Zdenek Kabelac
Typically snapshot may share
some blocks - or could have already be provisioning all blocks in
case shared blocks were already modified.

I understand and it's good technology.

Post by Xen
Yes I mean my own 'system' I generally of course know how much data is
on it and there is no automatic data generation.

However lvm2 is not 'Xen oriented' tool only.
We need to provide universal tool - everyone can adapt to their needs.

I said that to indicate that prediction problems are not current
important for me as much but they definitely would be important in other
scenarios or for other people.

You twist my words around to imply that I am trying to make myself
special, while I was making myself unspecial: I was just being modest
there.

Post by Zdenek Kabelac
Since your needs are different from others needs.

Yes and we were talking about the problems of prediction, thank you.

Post by Xen
But if I do create snapshots (which I do every day) when the root and
boot snapshots fill up (they are on regular lvm) they get dropped
which is nice,

old snapshot are different technology for different purpose.

Again, what I was saying was to support the notion that having snapshots
that may grow a lot can be a problem.

I am not sure the purpose of non-thin vs. thin snapshots is all that
different though.

They are both copy-on-write in a certain sense.

I think it is the same tool with different characteristics.

Post by Zdenek Kabelac
With 'plain' lvs output is - it's just an orientational number.
Basically highest referenced chunk for a thin given volume.
This is great approximation of size for a single thinLV.
But somewhat 'misleading' for thin devices being created as
snapshots...
(having shared blocks)

I understand. The above number for "snapshots" were just the missing
numbers from this summing up the volumes.

So I had no way to know snapshot usage.

I just calculated all used extents per volume.

The missing extents I put in snapshots.

So I think it is a very good approximation.

Post by Zdenek Kabelac
So you have no precise idea how many blocks are shared or uniquely
owned by a device.

Okay. But all the numbers were attributed to the correct volume
probably.

I did not count the usage of the snapshot volumes.

Whether they are shared or unique is irrelevant from the point of view
of wanting to know the total consumption of the "base" volume.

In the above 6 extents were not accounted for (24 MB) so I just assumed
that would be sitting in snapshots ;-).

Post by Zdenek Kabelac
Removal of snapshot might mean you release NOTHING from your
thin-pool if all snapshot blocks where shared with some other thin
volumes....

Yes, but that was not indicated in above figure either. It was just 24
MB that would be freed ;-).

Snapshots can only become a culprit if you start overwriting a lot of
data, I guess.

Post by Xen
If you say that any additional allocation checks would be infeasible
because it would take too much time per request (which still seems odd
because the checks wouldn't be that computation intensive and even for
100 gigabyte you'd only have 25.000 checks at default extent size) --
of course you asynchronously collect the data.

Processing of mapping of upto 16GiB of metadata will not happen in
miliseconds.... and consumes memory and CPU...

I get that. If that is the case.

That's just the sort of thing that in the past I have been keeping track
of continuously (in unrelated stuff) such that every mutation also
updated the metadata without having to recalculate it...

I am meaning to say that if indeed this is the case and indeed it is
this expensive, then clearly what I want is not possible with that
scheme.

I mean to say that I cannot argue about this design. You are the
experts.

I would have to go in learning first to be able to say anything about it
;-).

So I can only defer to your expertise. Of course.

But the purpose of what you're saying is that the number of uniquely
owned blocks by any snapshot is not known at any one point in time.

And needs to be derived from the entire map. Okay.

Thus reducing allocation would hardly be possible, you say.

Because the information is not known anyway.

Well pardon me for digging this deeply. It just seemed so alien that
this thing wouldn't be possible.

I mean it seems so alien that you cannot keep track of those numbers
runtime without having to calculate them using aggregate measures.

It seems information you want the system to have at all times.

I am just still incredulous that this isn't being done...

But I am not well versed in kernel concurrency measures so I am hardly
qualified to comment on any of that.

In any case, thank you for your time in explaining. Of course this is
what you said in the beginning as well, I am just still flabbergasted
that there is no accounting being done...

Regards.

Post by Xen
I think they are some of the most pleasant command line tools
anyway...

We try really hard....

You're welcome.

Post by Xen
On the other hand if all you can do is intervene in userland, then all
LVM team can do is provide basic skeleton for execution of some
standard scripts.

Yes - we give all the power to suit thin-p for individual needs to the user.

Which is of course pleasant.

Post by Zdenek Kabelac
So all you need to do is to use the tool in user-space for this task.

a) Govern max snapshot size and drop snapshots when they exceed this
b) Freeze non-critical volumes when thin space drops below aggegrate
values appropriate for the critical volumes
c) Drop snapshots when thin space <5% starting with the biggest one
d) Also freeze relevant snapshots in case (b)
e) Drop snapshots when exceeding max configured size in case of threshold reach.

But you are aware you can run such task even with cronjob.

...

Sure the point is not that it can't be done, but that it seems an unfair
burden on the system maintainer to do this in isolation of all other
system maintainers who might be doing the exact same thing.

There is some power in numbers and it is just rather facilitating if a
common scenario is somewhat provided by a central party.

I understand that every professional outlet dealing in terabytes upon
terabytes of data will have the manpower to do all of this and do it
well.

But for everyone else, it is a landscape you cannot navigate because you
first have to deploy that manpower before you can start using the
system!!!

It becomes a rather big enterprise to install thinp for anyone!!!

Because to get it running takes no time at all!!! But to get it running
well then implies huge investment.

I just wouldn't mind if this gap was smaller.

Many of the things you'd need to do are pretty standard. Running more
and more cronjobs... well I am already doing that. But it is not just
the maintenance of the cron job (installation etc.) but also the script
itself that you have to first write.

That means for me and for others that may not be doing it professionally
or in a larger organisation, the benefit of spending all that time may
not weigh up to the cost it has and the result is then that you keep
stuck with a deeply suboptimal situation in which there is little or no
reporting or fixing, all because the initial investment is too high.

Commonly provided scripts just hugely reduce that initial investment.

For example the bigger vs. smaller system I imagined. Yes I am eager to
make it. But I got other stuff to do as well :p.

And then, when I've made it, chances are high no one will ever use it
for years to come.

No one else I mean.

Post by Xen
So for example you configure max size for snapshot. When snapshots
exceeds size gets flagged for removal. But removal only happens when
other condition is met (threshold reach).

We are blamed already for having way too much configurable knobs....

Yes but I think it is better to script these things anyway.

Any official mechanism is only going to be inflexible when it goes that
far.

Like I personally don't like SystemD services compared to cronjobs.
Systemd services take longer to set up, have to agree to a descriptive
language, and so on.

Then you need to find out exactly what are the extents of the
possibilities of that descriptive language, maybe there is a feature you
do not know about yet, but you can probably also code it using knowledge
you already have and for which you do not need to read any man pages.

So I do create those services.... for the boot sequence... but anything
I want to run regularly I still do with a cron job...

It's a bit archaic to install but... it's simple, clean, and you have
everything in one screen.

Post by Xen
So you would have 5 different interventions you could use that could
be considered somewhat standard and the admit can just pick and choose
or customize.

And we have way longer list of actions we want to do ;) We have not
yet come to any single conclusion how to make such thing manageable
for a user...

Hmm.. Well I cannot ... claim to have the superior idea here.

But Idk... I think you can focus on the model right.

Maintaining max snapshot consumption is one model.

Freezing bigger volumes to protect space for smaller volumes is another
model.

Doing so based on a "critical" flag is another model... (not myself such
a fan of that)... (more to configure).

Reserving max, set or configured space for a specific volume is another
model.

(That would be actually equivalent to a 'critical' flag since only those
volumes that have reserved space would become 'critical' and their space
reservation is going to be the threshold to decide when to deny other
volumes more space).

So you can simply call the 'critical flag' idea the same as the 'space
reservation' idea.

The basic idea is that all space reservations get added together and
become a threshold.

So that's just one model and I think it is the most important one.

"Reserve space for certain volumes" (but not all of them or it won't
work). ;-).

This is what Gionatan refered to with the ZFS ehm... shit :p.

And the topic of this email thread.

So you might as well focus on that one alone as per mr. Jonathan's
reply.

(Pardon for my language there).

While personally I also like the bigger versus smaller idea because you
don't have to configure it.

The only configuration you need to do is to ensure that the more
important volumes are a bit smaller.

Which I like.

Then there is automatic space reservation using fsfreezing.

Because the free space required for bigger volumes is always going to be
bigger than that of smaller volumes.

Post by Xen
But how expensive is it to do it say every 5 seconds?

If you have big metadata - you would keep you Intel Core busy all the time ;)
That's why we have those thresholds.
Script is called at 50% fullness, then when it crosses 55%, 60%, ...
95%, 100%. When it drops bellow threshold - you are called again once
the boundary is crossed...

How do you know when it is at 50% fullness?

Post by Zdenek Kabelac
If you are proud sponsor of your electricity provider and you like the
extra heating in your house - you can run this in loop of course...
Threshold are based on mapped size for whole thin-pool.
Thin-pool surely knows all the time how many blocks are allocated and free for
its data and metadata devices.

But didn't you just say you needed to process up to 16GiB to know this
information?

I am confused?

This means the in-kernel policy can easily be implemented.

You may not know the size and attribution of each device but you do know
the overall size and availability?

Post by Xen
In any case the only policy you could have in-kernel would be either
what Gionatan proposed (fixed reserved space for certain volumes)
(easy calculation right) or potentially allocation freeze at threshold
for non-critical volumes,

In the single thin-pool all thins ARE equal.

But you could make them unequal ;-).

Post by Zdenek Kabelac
Low number of 'data' block may cause tremendous amount of provisioning.
With specifically written data pattern you can (in 1 second!) cause
provisioning of large portion of your thin-pool (if not the whole one
in case you have small one in range of gigabytes....)

Because you only have to write a byte to every extent, yes.

Post by Zdenek Kabelac
And that's the main issue - what we solve in lvm2/dm - we want to be
sure that when thin-pool is FULL - written & committed data are
secure and safe.
Reboot is mostly unavoidable if you RUN from a device which is
out-of-space -
we cannot continue to use such device - unless you add MORE space to
it within 60second window.

That last part is utterly acceptable.

Post by Zdenek Kabelac
All other proposals solve only very localized solution and problems
which are different for every user.
I.e. you could have a misbehaving daemon filling your system device
very fast with logs...
In practice - you would need some system analysis and detect which
application causes highest pressure on provisioning - but that's well
beyond range lvm2 team ATM with the amount of developers can
provide....

And any space reservation would probably not do much; if it is not
filled 100% now, it will be so in a few seconds, in that sense.

The goal was more to protect the other volumes, supposing that log
writing happened on another one, for that other log volume not to impact
the other main volumes.

So you have thin global reservation of say 10GB.

Your log volume is overprovisioned and starts eating up the 20GB you
have available and then runs into the condition that only 10GB remains.

The 10GB is a reservation maybe for your root volume. The system
(scripts) (or whatever) recognises that less than 10GB remains, that you
have claimed it for the root volume, and that the log volume is
intruding upon that.

It then decides to freeze the log volume.

But it is hard to decide what volume to freeze because it would need
that run-time analysis of what's going on. So instead you just freeze
all non-reserved volumes.

So all non-critical volumes in Gionatan and Brassow's parlance.

Post by Xen
I just still don't see how one check per 4MB would be that expensive
provided you do data collection in background.
You say size can be as low as 64kB... well.... in that case...

Default chunk size if 64k for the best 'snapshot' sharing - the bigger
the pool chunk is the less like you could 'share' it between
snapshots...

Okay.. I understand. I guess I was deluded a bit by non-thin snapshot
behaviour (filled up really fast without me understanding why, and
concluding that it was doing 4MB copies).

As well as of course that extents were calculated in whole numbers in
overviews... apologies.

But attribution of an extent to a snapshot will still be done in
extent-sizes right?

So I was just talking about allocation, nothing else.

BUT if allocator operates on 64kB requests, then yes...

Post by Zdenek Kabelac
(As pointed in other thread - ideal chunk for best snapshot sharing
would be 4K - but that's not affordable for other reasons....)

Okay.

Post by Xen
2) I would freeze non-critical volumes ( I do not write to
snapshots so that is no issue ) when critical volumes reached safety
threshold in free space ( I would do this in-kernel if I could ) ( But
Freezing In User-Space is almost the same ).

...

Well, yeah. Linux.

(I mean, just a single broken NFS or CIFS connection can break so
much....).

Post by Zdenek Kabelac
And more for your thinking -
If you have pressure on provisioning caused by disk-load on one of
your 'critical' volumes this FS 'freezeing' scripting will 'buy' you
only couple seconds

Oh yeah of course, this is correct.

Post by Zdenek Kabelac
(depends how fast drives you have and how big
thresholds you will use) and you are in the 'exact' same situation -
expect now you have system in bigger troubles - and you already might
have freezed other systems apps by having them accessing your
'low-prio' volumes....

Well I guess you would reduce non-critical volumes to single-purpose
things.

Ie. only used by one application.

Post by Zdenek Kabelac
And how you will be solving 'unfreezing' in cases thin-pool usage
drops down is also pretty interesting topic on its own...

I guess that would be manual?

Post by Zdenek Kabelac
I need to wish good luck when you will be testing and developing all
this machinery.

Well as you say it has to be an anomaly in the first place -- an error
or problem situation.

It is not standard operation.

So I don't think the problems of freezing are bigger than the problems
of rebooting.

The whole idea is that you attribute non-critical volumes to single apps
or single purposes so that when they run amock, or in any case, that if
anything runs amock on them...

Yes it won't protect the critical volumes from being written to.

But that's okay.

You don't need to automatically unfreeze.

You need to send an email and say stuff has happened ;-).

"System is still running but some applications may have crashed. You
will need to unfreeze and restart in order to solve it, or reboot if
necessary. But you can still log into SSH, so maybe you can do it
remotely without a console ;-)".

I don't see any issues with this.

One could say: use filesystem quotas.

Then that involves setting up users etc.

Setting up a quota for a specific user on a specific volume...

All more configuration.

And you're talking mostly about services of course.

The benefit (and danger) of LVM is that it is so easy to create more
volumes.

(The danger being that you now also need to back up all these volumes).

(Independently).

Post by Zdenek Kabelac
Default is to auto-extend thin-data & thin-metadata when needed if you
set threshold bellow 100%.

All snapshots/thins with 'fsynced' data are always secure.
Thin-pool is protecting all user-data on disk.
The only lost data are those flying in your memory (unwritten on disk).
And depends on you 'page-cache' setup how much that can be...

...

That seemes pretty secure. Thank you.

So there is no issue with snapshots behaving differently. It's all the
same and all committed data will be safe prior to the fillup and not
change afterward.

I guess.

Zdenek Kabelac

7 years ago

But if I do create snapshots (which I do every day) when the root and boot
snapshots fill up (they are on regular lvm) they get dropped which is nice,

old snapshot are different technology for different purpose.

Again, what I was saying was to support the notion that having snapshots that
may grow a lot can be a problem.

lvm2 makes them look the same - but underneath it's very different (and it's
not just by age - but also for targeting different purpose).

- old-snaps are good for short-time small snapshots - when there is estimation
for having low number of changes and it's not a big issue if snapshot is 'lost'.

- thin-snaps are ideal for long-time living objects with possibility to take
snaps of snaps of snaps and you are guaranteed the snapshot will not 'just
dissapear' while you modify your origin volume...

Both have very different resources requirements and performance...

I am not sure the purpose of non-thin vs. thin snapshots is all that different
though.
They are both copy-on-write in a certain sense.
I think it is the same tool with different characteristics.

That are cases where it's quite valid option to take old-snap of thinLV and
it will payoff...

Even exactly in the case you use thin and you want to make sure your temporary
snapshot will not 'eat' all your thin-pool space and you want to let snapshot die.

Thin-pool still does not support shrinking - so if the thin-pool auto-grows to
big size - there is not a way for lvm2 to reduce the thin-pool size...

That's just the sort of thing that in the past I have been keeping track of
continuously (in unrelated stuff) such that every mutation also updated the
metadata without having to recalculate it...

Would you prefer to spend all you RAM to keep all the mapping information for
all the volumes and put very complex code into kernel to parse the information
which is technically already out-of-data in the moment you get the result ??

In 99.9% of runtime you simply don't need this info.

But the purpose of what you're saying is that the number of uniquely owned
blocks by any snapshot is not known at any one point in time.

As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there is
nothing in kernel maintaining its dataset. You can have lots of thinLV active
and lots of other inactive.

Well pardon me for digging this deeply. It just seemed so alien that this
thing wouldn't be possible.

I'd say it's very smart ;)

You can use only very small subset of 'metadata' information for individual
volumes.

It becomes a rather big enterprise to install thinp for anyone!!!

It's enterprise level software ;)

Because to get it running takes no time at all!!! But to get it running well
then implies huge investment.

In most common scenarios - user knows when he runs out-of-space - it will not
be 'pleasant' experience - but users data should be safe.

And then it depends how much energy/time/money user wants to put into
monitoring effort to minimize downtime.

As has been said - disk-space is quite cheap.
So if you monitor and insert your new disk-space in-time (enterprise...) you
have less set of problems - then if you try to fight constantly with 100% full
thin-pool...

You have still problems even when you have 'enough' disk-space ;)
i.e. you select small chunk-size and you want extend thin-pool data volume
beyond addressable capacity - each chunk-size has its final maximum data size....

That means for me and for others that may not be doing it professionally or in
a larger organisation, the benefit of spending all that time may not weigh up
to the cost it has and the result is then that you keep stuck with a deeply
suboptimal situation in which there is little or no reporting or fixing, all
because the initial investment is too high.

You can always use normal device - it's really about the choice and purpose...

While personally I also like the bigger versus smaller idea because you don't
have to configure it.

I'm still proposing to use different pools for different purposes...

Sometimes spreading the solution across existing logic is way easier,
then trying to achieve some super-inteligent universal one...

Post by Zdenek Kabelac
Script is called at 50% fullness, then when it crosses 55%, 60%, ...
95%, 100%. When it drops bellow threshold - you are called again once
the boundary is crossed...

How do you know when it is at 50% fullness?

But didn't you just say you needed to process up to 16GiB to know this
information?

...

Of course thin-pool has to be aware how much free space it has.
And this you can somehow imagine as 'hidden' volume with FREE space...

So to give you this 'info' about free blocks in pool - you maintain very
small metadata subset - you don't need to know about all other volumes...

If other volume is releasing or allocation chunks - your 'FREE space' gets
updated....

It's complex underneath and locking is very performance sensitive - but for
easy understanding you can possibly get the picture out of this...

You may not know the size and attribution of each device but you do know the
overall size and availability?

Kernel support 1 setting for threshold - where the user-space (dmeventd) is
waked-up when usage has passed it.

The mapping of value is lvm.conf autoextend threshold.

As a 'secondary' source - dmeventd checks every 10 second pool fullness with
single ioctl() call and compares how the fullness has changed and provides you
with callbacks for those 50,55... jumps
(as can be found in 'man dmeventd')

So for autoextend theshold passing you get instant call.
For all others there is up-to 10 second delay for discovery.

Post by Zdenek Kabelac
In the single thin-pool all thins ARE equal.

But you could make them unequal ;-).

I cannot ;) - I'm lvm2 coder - dm thin-pool is Joe's/Mike's toy :)

In general - you can come with many different kernel modules which take
different approach to the problem.

Worth to note - RH has now Permabit in its porfolio - so there can more then
one type of thin-provisioning supported in lvm2...

Permabit solution has deduplication, compression, 4K blocks - but no snapshots....

The goal was more to protect the other volumes, supposing that log writing
happened on another one, for that other log volume not to impact the other
main volumes.

IMHO best protection is different pool for different thins...
You can more easily decide which pool can 'grow-up'
and which one should rather be taken offline.

So your 'less' important data volumes may simply hit the wall hard,
while your 'strategically important' one will avoid using overprovisioning as
much as possible to keep it running.

Motto: keep it simple ;)

So you have thin global reservation of say 10GB.
Your log volume is overprovisioned and starts eating up the 20GB you have
available and then runs into the condition that only 10GB remains.
The 10GB is a reservation maybe for your root volume. The system (scripts) (or
whatever) recognises that less than 10GB remains, that you have claimed it for
the root volume, and that the log volume is intruding upon that.
It then decides to freeze the log volume.

Of course you can play with 'fsfreeze' and other things - but all these things
are very special to individual users with their individual preferences.

Effectively if you freeze your 'data' LV - as a reaction you may paralyze the
rest of your system - unless you know the 'extra' information about the user
use-pattern.

But do not take this as something to discourage you to try it - you may come
with perfect solution for your particular system - and some other user may
find it useful in some similar pattern...

It's just something that lvm2 can't give support globally.

But lvm2 will give you enough bricks for writing 'smart' scripts...

Okay.. I understand. I guess I was deluded a bit by non-thin snapshot
behaviour (filled up really fast without me understanding why, and concluding
that it was doing 4MB copies).

Fast disks are now easily able to write gigabytes in second... :)

But attribution of an extent to a snapshot will still be done in extent-sizes
right?

Allocation unit in VG is 'extent' - ranges from 1sector to 4GiB
and default is 4M - yes....

So I don't think the problems of freezing are bigger than the problems of
rebooting.

With 'reboot' you know where you are - it's IMHO fair condition for this.

With frozen FS and paralyzed system and your 'fsfreeze' operation of
unimportant volumes actually has even eaten the space from thin-pool which may
possibly been used better to store data for important volumes....
and there is even big danger you will 'freeze' yourself already during call of
fsfreeze (unless you of course put BIG margins around)

"System is still running but some applications may have crashed. You will need
to unfreeze and restart in order to solve it, or reboot if necessary. But you
can still log into SSH, so maybe you can do it remotely without a console ;-)".

Compare with email:

Your system has run out-of-space, all actions to gain some more space has
failed - going to reboot into some 'recovery' mode

So there is no issue with snapshots behaving differently. It's all the same
and all committed data will be safe prior to the fillup and not change afterward.

Yes - snapshot is 'user-land' language - in kernel - all thins maps chunks...

If you can't map new chunk - things is going to stop - and start to error
things out shortly...

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
lvm2 makes them look the same - but underneath it's very different
(and it's not just by age - but also for targeting different purpose).
- old-snaps are good for short-time small snapshots - when there is
estimation for having low number of changes and it's not a big issue
if snapshot is 'lost'.
- thin-snaps are ideal for long-time living objects with possibility
to take snaps of snaps of snaps and you are guaranteed the snapshot
will not 'just dissapear' while you modify your origin volume...
Both have very different resources requirements and performance...

...

Point being that short-time small snapshots are also perfectly served by
thin...

So I don't really think there are many instances where "old" trumps
"thin".

Except, of course, if the added constraint is a plus (knowing in advance
how much it is going to cost).

But that's the only thing: predictability.

I use my regular and thin snapshots for the same purpose. Of course you
can do more with Thin.

Post by Zdenek Kabelac
That are cases where it's quite valid option to take old-snap of
thinLV and it will payoff...
Even exactly in the case you use thin and you want to make sure your
temporary snapshot will not 'eat' all your thin-pool space and you
want to let snapshot die.

Right.

That sounds pretty sweet actually. But it will be a lot slower right.

I currently just make new snapshots each day. They live for an entire
day. If the system wants to make a backup of the snapshot it has to do
it within the day ;-).

My root volume is not on thin and thus has an "old-snap" snapshot. If
the snapshot is dropped it is because of lots of upgrades but this is no
biggy; next week the backup will succeed. Normally the root volume
barely changes.

So it would be possible to reserve regular LVM space for thin volumes as
well right, for snapshots, as you say below. But will this not slow down
all writes considerably more than a thin snapshot?

So while my snapshots are short-lived, they are always there.

The current snapshot is always of 0:00.

Post by Zdenek Kabelac
Thin-pool still does not support shrinking - so if the thin-pool
auto-grows to big size - there is not a way for lvm2 to reduce the
thin-pool size...

Ah ;-). A detriment of auto-extend :p.

Post by Xen
That's just the sort of thing that in the past I have been keeping
track of continuously (in unrelated stuff) such that every mutation
also updated the metadata without having to recalculate it...

Would you prefer to spend all you RAM to keep all the mapping
information for all the volumes and put very complex code into kernel
to parse the information which is technically already out-of-data in
the moment you get the result ??

No if you only kept some statistics that would not amount to all the
mapping data but only to a summary of it.

Say if you write a bot that plays a board game. While searching for
moves the bot has to constantly perform moves on the board. It can
either create new board instances out of every move, or just mutate the
existing board and be a lot faster.

In mutating the board it will each time want the same information as
before: how many pieces does the white player have, how many pieces the
black player, and so on.

A lot of this information is easier to update than to recalculate, that
is, the moves themselves can modify this summary information, rather
than derive it again from the board positions.

This is what I mean by "updating the metadata without having to
recalculate it".

You wouldn't have to keep the mapping information in RAM, just the
amount of blocks attributed and so on. A single number. A few single
numbers for each volume and each pool.

No more than maybe 32 bytes, I don't know.

It would probably need to be concurrently updated, but that's what it
is.

You just maintain summary information that you do not recalculate, but
just modify each time an action is performed.

Post by Xen
But the purpose of what you're saying is that the number of uniquely
owned blocks by any snapshot is not known at any one point in time.

As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there
is nothing in kernel maintaining its dataset. You can have lots of
thinLV active and lots of other inactive.

But if it's not active, can it still 'trace' another volume? Ie. it has
to get updated if it is really a snapshot of something right.

If it doesn't get updated (and not written to) then it also does not
allocate new extents.

So then it never needs to play a role in any mechanism needed to prevent
allocation.

However volumes that see new allocation happening for them, would then
always reside in kernel memory right.

You said somewhere else that overall data (for pool) IS available. But
not for volumes themselves?

Ie. you don't have a figure on uniquely owned vs. shared blocks.

I get that it is not unambiguous to interpret these numbers.

Regardless with one volume as "master" I think a non-ambiguous
interpretation arises?

So is or is not the number of uniquely owned/shared blocks known for
each volume at any one point in time?

Post by Xen
Well pardon me for digging this deeply. It just seemed so alien that
this thing wouldn't be possible.

I'd say it's very smart ;)

You mean not keeping everything in memory.

Post by Zdenek Kabelac
You can use only very small subset of 'metadata' information for
individual volumes.

But I'm still talking about only summary information...

Post by Xen
It becomes a rather big enterprise to install thinp for anyone!!!

It's enterprise level software ;)

Well I get that you WANT that ;-).

However with the appropriate amount of user friendliness what was first
only for experts can be simply for more ordinary people ;-).

I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch
kuch, I right click on a volume in Windows Explorer, select properties,
select ReadyBoost tab, click "Reserve complete volume for ReadyBoost",
click okay, and I'm done.

It literally takes some 10 seconds to configure SSD caching on such a
machine.

Would probably take me some 2 hours in Linux not just to enter the
commands but also to think about how to do it.

Provided I don't end up with the SSD kernel issues with IO queue
bottlenecking I had before...

Which, I can tell you, took a multitude of those 2 hours with the
conclusion that the small mSata SSD I had was just not suitable, much
like some USB device.

For example, OpenVPN clients on Linux are by default not configured to
automatically reconnect when there is some authentication issue (which
could be anything, including a dead link I guess) and will thus simply
quit at the smallest issue. It then needs the "auth-retry nointeract"
directive to keep automatically reconnecting.

But on any Linux machine the command line version of OpenVPN is going to
be probably used as an unattended client.

So it made no sense to have to "figure this out" on your own. An
enterprise will be able to do so yes.

But why not make it easier...

And even if I were an enterprise, I would still want:

- ease of mind
- sane defaults
- if I make a mistake the earth doesn't explode
- If I forget to configure something it will have a good default
- System is self-contained and doesn't need N amount of monitoring
systems before it starts working

Post by Zdenek Kabelac
In most common scenarios - user knows when he runs out-of-space - it
will not be 'pleasant' experience - but users data should be safe.

Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8
with LVM 2.02.111 which, by now, is three years old hahaha.

Hehe, this is my self-made reporting tool:

Subject: Snapshot linux/root-snap has been umounted

Snapshot linux/root-snap has been unmounted from /srv/root because it
filled up to a 100%.

Log message:

Sep 16 22:37:58 debian lvm[16194]: Unmounting invalid snapshot
linux-root--snap from /srv/root.

Earlier messages:

Sep 16 22:37:52 debian lvm[16194]: Snapshot linux-root--snap is now 97%
full.
Sep 16 22:37:42 debian lvm[16194]: Snapshot linux-root--snap is now 93%
full.
Sep 16 22:37:32 debian lvm[16194]: Snapshot linux-root--snap is now 86%
full.
Sep 16 22:37:22 debian lvm[16194]: Snapshot linux-root--snap is now 82%
full.

Now do we or do we not upgrade to Debian Stretch lol.

Post by Zdenek Kabelac
And then it depends how much energy/time/money user wants to put into
monitoring effort to minimize downtime.

Well yes but this is exacerbated by say this example of OpenVPN having
bad defaults. If you can't figure out why your connection is not
maintained now you need monitoring script to automatically restart it.

If something is hard to recover from, now you need monitoring script to
warn you plenty ahead of time so you can prevent it, etc.

If the monitoring script can fail, now you need a monitoring script to
monitor the monitoring script ;-).

System admins keep busy ;-).

Post by Zdenek Kabelac
As has been said - disk-space is quite cheap.
So if you monitor and insert your new disk-space in-time
(enterprise...) you have less set of problems - then if you try to
fight constantly with 100% full thin-pool...

In that case it's more of a safety measure. But a bit pointless if you
don't intend to keep growing your data collection.

Ie. you could keep an extra disk in your system for this purpose, but
then you can't shrink the thing as you said once it gets used ;-).

That makes it rather pointless to have it as a safety net for a system
that is not meant to expand ;-).

Post by Zdenek Kabelac
You can always use normal device - it's really about the choice and purpose...

Well the point is that I never liked BTRFS.

BTRFS has its own set of complexities and people running around and
tumbling over each other in figuring out how to use the darn thing.
Particularly with regards to the how-to of using subvolumes, of which
there seem to be many different strategies.

And then Red Hat officially deprecates it for the next release. Hmmmmm.

So ZFS has very linux-unlike command set.

Its own universe.

LVM in general is reasonably customer-friendly or user-friendly.
Configuring cache volumes etc. is not that easy but also not that
complicated. Configuring RAID is not very hard compared to mdadm
although it remains a bit annoying to have to remember pretty explicit
commands to manage it.

But rebuilding e.g. RAID 1 sets is pretty easy and automatic.

Sometimes there is annoying stuff like not being able to change a volume
group (name) when a PV is missing, but if you remove the PV how do you
put it back in? And maybe you don't want to... well whatever.

I guess certain things are difficult enough that you would really want a
book about it, and having to figure it out is fun the first time but
after that a chore.

So I am interested in developing "the future" of computing you could
call it.

I believe that using multiple volumes is "more natural" than a single
big partition.

But traditionally the "single big partition" is the only way to get a
flexible arrangement of free space.

So when you move towards multiple (logical) volumes, you lose that
flexibility that you had before.

The only way to solve that is by making those volumes somewhat virtual.

And to have them draw space from the same big pool.

So you end up with thin provisioning. That's all there is to it.

Post by Xen
While personally I also like the bigger versus smaller idea because
you don't have to configure it.

I'm still proposing to use different pools for different purposes...

You mean use a different pool for that one critical volume that can't
run out of space.

This goes against the idea of thin in the first place. Now you have to
give up the flexibility that you seek or sought in order to get some
safety because you cannot define any constraints within the existing
system without separating physically.

Post by Zdenek Kabelac
Sometimes spreading the solution across existing logic is way easier,
then trying to achieve some super-inteligent universal one...

I get that... building a wall between two houses is easier than having
to learn to live together.

But in the end the walls may also kill you ;-).

Now you can't share washing machine, you can't share vacuum cleaner, you
have to have your own copy of everything, including bath rooms, toilet,
etc.

Even though 90% of the time these things go unused.

So resource sharing is severely limited by walls.

Total cost of services goes up.

Post by Xen
But didn't you just say you needed to process up to 16GiB to know this
information?

Right, just a list of blocks that are free.

Post by Zdenek Kabelac
If other volume is releasing or allocation chunks - your 'FREE space'
gets updated....

That's what I meant by mutating the data (summary).

Post by Zdenek Kabelac
It's complex underneath and locking is very performance sensitive -
but for easy understanding you can possibly get the picture out of
this...

I understand, but does this mean that the NUMBER of free blocks is also
always known?

So isn't the NUMBER of used/shared blocks in each DATA volume also
known?

Post by Xen
You may not know the size and attribution of each device but you do
know the overall size and availability?

Kernel support 1 setting for threshold - where the user-space
(dmeventd) is waked-up when usage has passed it.
The mapping of value is lvm.conf autoextend threshold.
As a 'secondary' source - dmeventd checks every 10 second pool
fullness with single ioctl() call and compares how the fullness has
changed and provides you with callbacks for those 50,55... jumps
(as can be found in 'man dmeventd')
So for autoextend theshold passing you get instant call.
For all others there is up-to 10 second delay for discovery.

...

But that's about the 'free space'.

What about the 'used space'. Could you, potentially, theoretically, set
a threshold for that? Or poll for that?

I mean the used space of each volume.

Post by Xen
But you could make them unequal ;-).

I cannot ;) - I'm lvm2 coder - dm thin-pool is Joe's/Mike's toy :)
In general - you can come with many different kernel modules which
take different approach to the problem.
Worth to note - RH has now Permabit in its porfolio - so there can
more then one type of thin-provisioning supported in lvm2...
Permabit solution has deduplication, compression, 4K blocks - but no snapshots....

Hmm, sounds too 'enterprise' for me ;-).

In principle it comes down to the same thing... one big pool of storage
and many views onto it.

Deduplication is natural part of that...

Also for backup purposes mostly.

You can have 100 TB worth of backups only using 5 TB.

Without having to primitively hardlink everything.

And maintaining complete trees of every backup open on your
filesystem.... no usage of archive formats...

If the system can hardlink blocks instead of files, that is very
interesting.

Of course snapshots (thin) are also views onto the dataset.

That's the point of sharing.

But sometimes you live in the same house and you want a little room for
yourself ;-).

But in any case...

Of course if you can only change lvm2, maybe nothing of what I said was
ever possible.

But I thought you also spoke of possibilities including the possibility
of changing the device mapper, saying it is impossible what I want :p.

IF you could change the device mapper, THEN could it be possible to
reserve allocation space for a single volume???

All you have to do is lie to the other volumes when they want to know
how much space is available ;-).

Or something of the kind.

Logically there are only two conditions:

- virtual free space for critical volume is smaller than its reserved
space
- virtual free space for critical volume is bigger than its reserved
space

If bigger, then all the reserved space is necessary to stay free
If smaller, then we don't need as much.

But it probably also doesn't hurt.

So 40GB virtual volume has 5GB free but reserved space is 10GB.

Now real reserved space also becomes 5GB.

So for this system to work you need only very limited data points:

- unallocated extents of virtual 'critical' volumes (1 number for each
'critical' volume)
- total amount of free extents in pool

And you're done.

+ the reserved space for each 'critical volume'.

So say you have 2 critical volumes:

virtual size reserved space
10GB 500MB
40GB 10GB

Total reserved space is 10.5GB

If second one has allocated 35GB, only could possibly need 5GB more, so
figure changes to

5.5GB reserved space

Now other volumes can't touch that space, when the available free space
in entire pool becomes <= 5.5GB, allocation fails for non-critical
volumes.

It really requires very limited information.

- free extents for all critical volumes (unallocated as per the virtual
size)
- total amount free extents in pool
- max space reservation for each critical volume

And you're done. You now have a working system. This is the only
information the allocator needs to employ this strategy.

No full maps required.

If you have 2 critical volumes, this is a total of 5 numbers.

This is 40 bytes of data at most.

Post by Xen
The goal was more to protect the other volumes, supposing that log
writing happened on another one, for that other log volume not to
impact the other main volumes.

IMHO best protection is different pool for different thins...
You can more easily decide which pool can 'grow-up'
and which one should rather be taken offline.

Yeah yeah.

But that is like avoiding the problem, so there doesn't need to be a
solution.

Post by Zdenek Kabelac
Motto: keep it simple ;)

The entire idea of thin provisioning is to not keep it simple ;-).

Same goes for LVM.

Otherwise we'd be still using physical partitions.

Post by Xen
So you have thin global reservation of say 10GB.
Your log volume is overprovisioned and starts eating up the 20GB you
have available and then runs into the condition that only 10GB remains.
The 10GB is a reservation maybe for your root volume. The system
(scripts) (or whatever) recognises that less than 10GB remains, that
you have claimed it for the root volume, and that the log volume is
intruding upon that.
It then decides to freeze the log volume.

...

Many things only work if the user follows a certain model of behaviour.

The whole idea of having a "critical" versus a "non-critical" volume is
that you are going to separate the dependencies such that a failure of
the "non-critical" volume will not be "critical" ;-).

So the words themselves predict that anyone employing this strategy will
ensure that the non-critical volumes are not critically depended upon
;-).

Post by Zdenek Kabelac
But do not take this as something to discourage you to try it - you
may come with perfect solution for your particular system - and some
other user may find it useful in some similar pattern...
It's just something that lvm2 can't give support globally.

I think the model is clean enough that you can provide at least a
skeleton script for it...

But that was already suggested you know, so...

If people want different intervention than "fsfreeze" that is perfectly
fine.

Most of the work goes into not deciding the intervention (that is
usually simple) but in writing the logic.

(Where to store the values, etc.).

(Do you use LVM tags, how to use that, do we read some config file
somewhere else, etc.).

Only reason to provide skeleton script with LVM is to lessen the burden
on all those that would like to follow that separation of critical vs.
non-critical.

The big vs. small idea is extension of that.

Of course you don't have to support it in that sense personally.

But logical separation of more critical vs. less critical of course
would require you to also organize your services that way.

If you have e.g. three levels of critical services (A B C) and three
levels of critical volumes (X Y Z) then:

A (most critical) B (intermediate) C (least critical)
| ___/| _______/ ___/|
| ___/ _|____/ ___/ |
| ___/ ____/ | ___/ |
| ___/_____/ | ___/ |
| / | / |
X (most critical) Y (intermediate) Z (least critical)

Service A can only use volume X
Service B can use both X and Y
Service C can use X Y and Z.

This is the logical separation you must make if "critical" is going to
have any value.

Post by Zdenek Kabelac
But lvm2 will give you enough bricks for writing 'smart' scripts...

I hope so.

It is just convenient if certain models are more mainstream or more easy
to implement.

Instead of each person having to reinvent the wheel...

But anyway.

I am just saying that the simple thing Sir Jonathan offered would
basically implement the above.

It's not very difficult, just a bit of level-based separation of orders
of importance.

Of course the user (admin) is responsible for ensuring that programs
actually agree with it.

Post by Xen
So I don't think the problems of freezing are bigger than the problems
of rebooting.

Fsfreeze would not eat more space than was already eaten.

A reboot doesn't change anything about that either.

If you don't freeze it (and neither reboot) the whole idea is that more
space would be eaten than was already.

So not doing anything is not a solution (and without any measures in
place like this, the pool would be full).

So we know why we want reserved space; it was already rapidly being
depleted.

Post by Zdenek Kabelac
and there is even big danger you will 'freeze' yourself already during
call of fsfreeze (unless you of course put BIG margins around)

Well I didn't say fsfreeze was the best high level solution anyone could
ever think of.

But I think freezing a less important volume should ... according to the
design principles laid out above... not undermine the rest of the
'critical' system.

That's the whole idea right.

Again not suggesting everyone has to follow that paradigm.

But if you're gonna talk about critical vs. non-critical, the admin has
to pursue that idea throughout the entire system.

If I freeze a volume only used by a webserver... I will only freeze the
webserver... not anything else?

Post by Xen
"System is still running but some applications may have crashed. You
will need to unfreeze and restart in order to solve it, or reboot if
necessary. But you can still log into SSH, so maybe you can do it
remotely without a console ;-)".

Your system has run out-of-space, all actions to gain some more space
has failed - going to reboot into some 'recovery' mode

Actions to gain more space in this case only amounts to dropping
snapshots, otherwise we are talking much more aggressive policy.

So now your system has rebooted and is in a recovery mode. Your system
ran 3 different services. SSH/shell/email/domain etc, webserver and
providing NFS mounts.

Very simple example right.

Your webserver had dedicated 'less critical' volume.

Some web application overflowed, user submitted lots of data, etc.

Web application volume is frozen.

(Or web server has been shut down, same thing here).

- Now you can still SSH, system still receives and sends email
- You can still access filesystems using NFS

Compare to recovery console:

- SSH doesn't work, you need Console
- email isn't received nor sent
- NFS is unavailable
- pings to domain don't work
- other containers go offline too
- entire system is basically offline.

Now for whatever reason you don't have time to solve the problem.

System is offline for a week. Emails are thrown away, not received, you
can't ssh and do other tasks, you may be able to clean the mess but you
can't put the server online (webserver) in case it happens again.

You need time to deal with it but in the meantime entire system was
offline. You have to manually reboot and shut down web application.

But in our proposed solution, the script already did that for you.

So same outcome. Less intervention from you required.

Better to keep the system running partially than not at all?

SSH access is absolute premium in many cases.

Post by Xen
So there is no issue with snapshots behaving differently. It's all the
same and all committed data will be safe prior to the fillup and not
change afterward.

Yes - snapshot is 'user-land' language - in kernel - all thins maps chunks...
If you can't map new chunk - things is going to stop - and start to
error things out shortly...

I get it.

We're going to prevent them from mapping new chunks ;-).

Well.

:p.

Xen

7 years ago

Post by Xen
But if it's not active, can it still 'trace' another volume? Ie. it
has to get updated if it is really a snapshot of something right.
If it doesn't get updated (and not written to) then it also does not
allocate new extents.

Oh now I get what you mean.

If it's not active it can also in that sense not reserve any extents for
itself.

So the calculations I proposed way below require at least 2 numbers for
each 'critical' volume to be present in the kernel.

Which is the unallocated virtual size and the reserved space.

So even if they're not active they would need to provide this
information somehow.

Of course the information also doesn't change if it's not active, so it
would just be 2 static numbers.

But then what happens if you change the reserved space etc...

In any case that sorta thing would indeed be required...

(For an in-kernel thing...)

(Also any snapshot of a critical volume would not in itself become a
critical volume...)

(But you're saying that "thin snapshot" is not a kernel concept...)

Xen

7 years ago

Post by Xen
If it's not active it can also in that sense not reserve any extents
for itself.

But if it's not active I don't see why it should be critical or why you
should reserve space for it to be honest...

Gionatan Danti

7 years ago

Post by Xen
But if it's not active I don't see why it should be critical or why
you should reserve space for it to be honest...

Xen, I really think that the combination of hard-threshold obtained by
setting thin_pool_autoextend_threshold and thin_command hook for
user-defined script should be sufficient to prevent and/or react to full
thin pools.

I'm all for the "keep it simple" on the kernel side. After all, thinp
maintain very high performance in spite of its CoW behavior *even when
approaching pool fullness*, a thing which can not be automatically said
for advanced in-kernel filesystems as BTRFS (which very low
random-rewrite performance) and ZFS (I just recently opened a ZoL issue
for ZVOLs with *much* lower than expected write performance, albeit the
workaround/correction was trivial in this case).

That said, I would like to see some pre-defined scripts to easily manage
pool fullness. For example, a script to automatically delete all
inactive snapshots with "deleteme" or "temporary" flag. Sure, writing
such a script is trivial for any sysadmin - but I would really like the
standardisation such predefined scripts imply.

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Xen

7 years ago

Post by Gionatan Danti
Xen, I really think that the combination of hard-threshold obtained by
setting thin_pool_autoextend_threshold and thin_command hook for
user-defined script should be sufficient to prevent and/or react to
full thin pools.

I will hopefully respond to Zdenek's message later (and the one before
that that I haven't responded to),

Post by Gionatan Danti
I'm all for the "keep it simple" on the kernel side.

But I don't mind if you focus on this,

Post by Gionatan Danti
That said, I would like to see some pre-defined scripts to easily
manage pool fullness. (...) but I would really
like the standardisation such predefined scripts imply.

And only provide scripts instead of kernel features.

Again, the reason I am also focussing on the kernel is because:

a) I am not convinced it cannot be done in the kernel
b) A kernel feature would make space reservation very 'standardized'.

Now I'm not convinced I really do want a kernel feature but saying it
isn't possible I think is false.

The point is that kernel features make it much easier to standardize and
to put some space reservation metric in userland code (it becomes a
default feature) and scripts remain a little bit off to the side.

However if we *can* standardize on some tag or way of _reserving_ this
space, I'm all for it.

I think a 'critical' tag in combination with the standard
autoextend_threshold (or something similar) is too loose and ill-defined
and not very meaningful.

In other words you would be abusing one feature for another purpose.

So I do propose a way to tag volumes with a space reservation (turning
them cricical) or alternatively to configure a percentage of reserved
space and then merely tag some volumes as critical volumes.

I just want these scripts to be such that you don't really need to
modify them.

In other words: values configured elsewhere.

If you think that should be the thin_pool_autoextend_threshold, fine,
but I really think it should be configured elsewhere (because you are
not using it for autoextending in this case).

thin_command is run every 5%:

https://www.mankier.com/8/dmeventd

You will need to configure a value to check against.

This is either going to be a single, manually configured, fixed value
(in % or extents)

Or it can be calculated based on reserved space of individual volumes.

So if you are going to have a kind of "fsfreeze" script based on
critical volumes vs. non-critical volumes I'm just saying it would be
preferable to set the threshold at which to take action in another way
than by using the autoextend_threshold for that.

And I would prefer to set individual space reservation for each volume
even if it can only be compared to 5% threshold values.

So again: if you want to focus on scripts, fine.

Zdenek Kabelac

7 years ago

I will hopefully respond to Zdenek's message later (and the one before that
that I haven't responded to),

Post by Gionatan Danti
I'm all for the "keep it simple" on the kernel side.

But I don't mind if you focus on this,

Post by Gionatan Danti
That said, I would like to see some pre-defined scripts to easily
manage pool fullness. (...) but I would really
like the standardisation such predefined scripts imply.

And only provide scripts instead of kernel features.
a) I am not convinced it cannot be done in the kernel
b) A kernel feature would make space reservation very 'standardized'.

...

Hi

Some more 'light' into the existing state as this is really not about what can
and what cannot be done in kernel - as clearly you can do 'everything' in
kernel - if you have the code for it...

I'm here explaining position of lvm2 - which is user-space project (since we
are on lvm2 list) - and lvm2 is using 'existing' dm kernel target which
provides thin-provisioning (and has it's configurables). So this is kernel
piece and differs from user-space lvm2 counterpart.

Surely there is cooperation between these two - but anyone else can write some
other 'dm' target - and lvm2 can extend support for given target/segment
type if such target is used by users.

In practice your 'proposal' is quite different from the existing target -
essentially major rework if not a whole new re-implementation - as it's not
'a few line' patch extension which you might possibly believe/hope into.

I can (and effectively I've already spent a lot of time) explaining the
existing logic and why it is really hardly doable with current design, but we
cannot work on support for 'hypothetical' non-existing kernel target from lvm2
side - so you need to start from 'ground-zero' level on dm target design....
or you need to 'reevaluate' your vision to be more in touch with existing
kernel target output...

However we believe our exiting solution in 'user-space' can cover most common
use-cases and we might just have 'big-holes' in providing better documentation
to explain reasoning and guide users to use existing technology in more
optimal way.

The point is that kernel features make it much easier to standardize and to
put some space reservation metric in userland code (it becomes a default
feature) and scripts remain a little bit off to the side.

Maintenance/devel/support of kernel code is more expensive - it's usually very
easy to upgrade small 'user-space' encapsulated package - compared with major
changes on kernel side.

So that's where dm/lvm2 design gets from - do the 'minimum necessary' inside
kernel and maximize usage of user-space.

Of course this decision makes some tasks harder (i.e. there are surely
problems which would not even exist if it would be done in kernel) - but lots
of other things are way easier - you really can't compare those....

Yeah - standards are always problem :) i.e. Xorg & Wayland....
but it's way better to play with user-space then playing with kernel....

However if we *can* standardize on some tag or way of _reserving_ this space,
I'm all for it.

Problems of a desktop user with 0.5TB SSD are often different with servers
using 10PB across multiple network-connected nodes.

I see you call for one standard - but it's very very difficult...

I think a 'critical' tag in combination with the standard autoextend_threshold
(or something similar) is too loose and ill-defined and not very meaningful.

We look for delivering admins rock-solid bricks.

If you make small house or you build a Southfork out of it is then admins' choice.

We have spend really lot of time thinking if there is some sort of
'one-ring-to-rule-them-all' solution - but we can't see it yet - possibly
because we know wider range of use-cases compared with individual user-focused
problem.

And I would prefer to set individual space reservation for each volume even if
it can only be compared to 5% threshold values.

Which needs 'different' kernel target driver (and possibly some way to
kill/split page-cache to work on 'per-device' basis....)

And just as an illustration of problems you need to start solving for this design:

You have origin and 2 snaps.
You set different 'thresholds' for these volumes -
You then overwrite 'origin' and you have to maintain 'data' for OTHER LVs.
So you get into the position - when 'WRITE' to origin will invalidate volume
that is NOT even active (without lvm2 being even aware).
So suddenly rather simple individual thinLV targets will have to maintain
whole 'data set' and cooperate with all other active thins targets in case
they share some data.... - so in effect WHOLE data tree needs to be
permanently accessed - this could be OK when you focus for use of 3 volumes
with at most couple hundreds GiB of addressable space - but does not 'fit'
well for 1000LVs and PB of addressable data.

Regards

Zdenek

Xen

7 years ago

Hi,

thank you for your response once more.

Post by Zdenek Kabelac
Hi
Some more 'light' into the existing state as this is really not about
what can and what cannot be done in kernel - as clearly you can do
'everything' in kernel - if you have the code for it...

Well thank you for that ;-).

Post by Zdenek Kabelac
In practice your 'proposal' is quite different from the existing
target - essentially major rework if not a whole new re-implementation
- as it's not 'a few line' patch extension which you might possibly
believe/hope into.

Well I understand that the solution I would be after would require
modification to the DM target. I was not arguing for LVM alone; I
assumed that since DM and LVM are both hosted in the same space there
would be at least the idea of cooperation between the two teams.

And that it would not be too 'radical' to talk about both at the same
time.

Post by Zdenek Kabelac
Of course this decision makes some tasks harder (i.e. there are surely
problems which would not even exist if it would be done in kernel) -
but lots of other things are way easier - you really can't compare
those....

I understand. But many times lack of integration of shared goal of
multiple projects is also big problem in Linux.

Post by Xen
However if we *can* standardize on some tag or way of _reserving_ this
space, I'm all for it.

Problems of a desktop user with 0.5TB SSD are often different with
servers using 10PB across multiple network-connected nodes.
I see you call for one standard - but it's very very difficult...

I am pretty sure that if you start out with something simple, it can
extend into the complex.

That's of course why an elementary kernel feature would make sense.

A single number. It does not get simpler than that.

I am not saying you have to.

I was trying to find out if your statements that something was
impossible, was actually true.

You said that you need a completely new DM target from the ground up. I
doubt that. But hey, you're the expert, not me.

I like that you say that you could provide an alternative to the regular
DM target and that LVM could work with that too.

Unfortunately I am incapable of doing any development myself at this
time (sounds like fun right) and I also of course could not myself test
20 PB.

Post by Xen
I think a 'critical' tag in combination with the standard
autoextend_threshold (or something similar) is too loose and
ill-defined and not very meaningful.

We look for delivering admins rock-solid bricks.
If you make small house or you build a Southfork out of it is then admins' choice.
We have spend really lot of time thinking if there is some sort of
'one-ring-to-rule-them-all' solution - but we can't see it yet -
possibly because we know wider range of use-cases compared with
individual user-focused problem.

...

I think you have to start simple.

You can never come up with a solution if you start out with the complex.

The only thing I ever said was:
- give each volume a number of extents or a percentage of reserved space
if needed
- for all the active volumes in the thin pool, add up these numbers
- when other volumes require allocation, check against free extents in
the pool
- possibly deny allocation for these volumes

I am not saying here you MUST do anything like this.

But as you say, it requires features in the kernel that are not there.

I did not know or did not realize the upgrade paths of the DM module(s)
and LVM2 itself would be so divergent.

So my apologies for that but obviously I was talking about a full-system
solution (not partial).

Post by Xen
And I would prefer to set individual space reservation for each volume
even if it can only be compared to 5% threshold values.

Which needs 'different' kernel target driver (and possibly some way to
kill/split page-cache to work on 'per-device' basis....)

No no, here I meant to set it by a script or to read it by a script or
to use it by a script.

Post by Zdenek Kabelac
You have origin and 2 snaps.
You set different 'thresholds' for these volumes -

I would not allow setting threshold for snapshots.

I understand that for dm thin target they are all the same.

But for this model it does not make sense because LVM talks of "origin"
and "snapshots".

Post by Zdenek Kabelac
You then overwrite 'origin' and you have to maintain 'data' for OTHER LVs.

I don't understand. Other LVs == 2 snaps?

Post by Zdenek Kabelac
So you get into the position - when 'WRITE' to origin will invalidate
volume that is NOT even active (without lvm2 being even aware).

I would not allow space reservation for inactive volumes.

Any space reservation is meant for safeguarding the operation of a
machine.

Thus it is meant for active volumes.

Post by Zdenek Kabelac
So suddenly rather simple individual thinLV targets will have to
maintain whole 'data set' and cooperate with all other active thins
targets in case they share some data

I don't know what data sharing has to do with it.

The entire system only works with unallocated extents.

Zdenek Kabelac

7 years ago

Post by Xen
Hi,
thank you for your response once more.

Post by Zdenek Kabelac
Hi
Of course this decision makes some tasks harder (i.e. there are surely
problems which would not even exist if it would be done in kernel) -
but lots of other things are way easier - you really can't compare
those....

I understand. But many times lack of integration of shared goal of multiple
projects is also big problem in Linux.

And you also have project that do try to integrate shared goals like btrfs.

Post by Xen
However if we *can* standardize on some tag or way of _reserving_ this
space, I'm all for it.

Problems of a desktop user with 0.5TB SSD are often different with
servers using 10PB across multiple network-connected nodes.
I see you call for one standard - but it's very very difficult...

I am pretty sure that if you start out with something simple, it can extend
into the complex.

We hope community will provide some individual scripts...
Not a big deal to integrate them into repo dir...

Post by Zdenek Kabelac
We have spend really lot of time thinking if there is some sort of
'one-ring-to-rule-them-all' solution - but we can't see it yet -
possibly because we know wider range of use-cases compared with
individual user-focused problem.

I think you have to start simple.

It's mostly about what can be supported 'globally'
and what is rather 'individual' customization.

Post by Xen
You can never come up with a solution if you start out with the complex.
- give each volume a number of extents or a percentage of reserved space if
needed

Which can't be deliver with current thinp technology.
It's simply too computational invasive for our targeted performance.

The only deliverable we have is - you create a 'cron' job that does hard
'computing' once in a while - and makes some 'action' when individual
'volumes' goes out of their preconfigured boundaries. (often such logic is
implemented outside of lvm2 - in some DB engine - since lvm2 itself is
really NOT a high performing DB - the ascii format has it's age....)

You can't get this 'percentage' logic online in kernel (aka while you update
individual volume).

Post by Xen
- for all the active volumes in the thin pool, add up these numbers
- when other volumes require allocation, check against free extents in the pool

I assume you possibly missed this logic of thin-p:

When you update origin - you always allocate FOR origin, but allocated chunk
remains claimed by snapshots (if there are any).

So if snapshot shared all pages with the origin at the beginning (so basically
consumed only some 'metadata' space and 0% real exclusive own space) - after
full rewrite of the origin your snapshot suddenly 'holds' all the old chunks
(100% of its size)

So when you 'write' to ORIGIN - your snapshot which becomes bigger in terms of
individual/exclusively owned chunks - so if you have i.e. configured snapshot
to not consume more then XX% of your pool - you would simply need to recalc
this with every update on shared chunks....

And as has been already said - this is currently unsupportable 'online'

Another aspect here is - thin-pool has no idea about 'history' of volume
creation - it doesn't not know there is volume X being snapshot of volume Y -
this all is only 'remembered' by lvm2 metadata - in kernel - it's always
like - volume X owns set of chunks 1...
That's all kernel needs to know for a single thin volume to work.

You can do it with 'reasonable' delay in user-space upon 'triggers' of global
threshold (thin-pool fullness).

Post by Xen
- possibly deny allocation for these volumes

Unsupportable in 'kernel' without rewrite and you can i.e. 'workaround' this
by placing 'error' targets in place of less important thinLVs...

Imagine you would get pretty random 'denials' of your WRITE request depending
on interaction with other snapshots....

Surely if use 'read-only' snapshot you may not see all related problems, but
such a very minor subclass of whole provisioning solution is not worth a
special handling of whole thin-p target.

Post by Xen
I did not know or did not realize the upgrade paths of the DM module(s) and
LVM2 itself would be so divergent.

lvm2 is volume manager...

dm is implementation layer for different 'segtypes' (in lvm2 terminology).

So i.e. anyone can write his own 'volume manager' and use 'dm' - it's fully
supported - dm is not tied to lvm2 and is openly designed (and used by other
projects)....

Post by Xen
So my apologies for that but obviously I was talking about a full-system
solution (not partial).

yep - 2 different worlds....

i.e. crypto, multipath,...

Post by Zdenek Kabelac
You have origin and 2 snaps.
You set different 'thresholds' for these volumes -

I would not allow setting threshold for snapshots.
I understand that for dm thin target they are all the same.
But for this model it does not make sense because LVM talks of "origin" and
"snapshots".

Post by Zdenek Kabelac
You then overwrite 'origin' and you have to maintain 'data' for OTHER LVs.

I don't understand. Other LVs == 2 snaps?

yes - other LVs are snaps in this example...

Post by Zdenek Kabelac
So you get into the position - when 'WRITE' to origin will invalidate
volume that is NOT even active (without lvm2 being even aware).

I would not allow space reservation for inactive volumes.

You are not 'reserving' any space as the space already IS assigned to those
inactive volumes.

What you would have to implement is to TAKE the space FROM them to satisfy
writing task to your 'active' volume and respect prioritization...

If you will not implement this 'active' chunk 'stealing' - you are really ONLY
shifting 'hit-the-wall' time-frame.... (worth possibly couple seconds only of
your system load)...

In other words - tuning 'thresholds' in userspace's 'bash' script will give
you very same effect as if you are focusing here on very complex 'kernel'
solution.

Regards

Zdenek

Xen

7 years ago

Zdenek, you will believe in below email that I am advocating for max
snapshot size.

I was not.

Only kernel feature I was suggesting was making judgement about when or
how to refuse allocation for new chunks. Nothing else. Not based on
consumed space, unique space consumed by volumes or snapshots.

Based only on FREE SPACE metric, not USED SPACE metric (which can be
more complex).

When you say that freezing allocation has same effect as error target,
you could be correct.

I will not respond to individual remarks but rewrite here below as
summary:

- call collection of all critical volumes C.

- call C1 and C2 members of C.

- each Ci ∈ C has a number FE(Ci) for the number of free (unallocated)
extents of that volume
- each Ci ∈ C has a fixed number RE(Ci) for the number of reserved
extents.

Observe that FE(Ci) may be smaller than RE(Ci). E.g. a volume may have
1000 reserved extents (RE(Ci)) but 500 free extents (FE(Ci)) at which
point it has more reserved extents than it can use.

Therefore in our calculations we use the smaller of those two numbers
for the effective reserved extents (ERE(Ci)).

ERE(Ci) = min( FE(Ci), RE(Ci) )

Now the total number of effective reserved extents of the pool is the
total effective number of reserved extents of collection C.

ERE(POOL) = ERE(C) = ∑ ERE(Ci)

This number is dependent on the live number of free extents of each
critical volume Ci.

Now the critical equation that each time will be evaluated when a chunk
is being requested for allocation, is:

ERE(POOL) < FE(POOL)

As long as the Effective Reserved Extents of the entire pool is smaller
than the number of Free Extents in the entire pool, nothing is the
matter.

However, when

ERE(POOL) >= FE(POOL) we enter a critical 'fullness' situation.

This may be likened to a 95% threshold.

At this point you will start 'randomly' denying not only write requests
for regular volumes, or writeable snapshots, but also regular read-only
snapshots can see their allocation requests (for CoW) denied.

This would of course immediately invalidate those snapshots, if the
write request was caused by a critical volume (Ci) that is still being
serviced.

If you say this is not much different from replacing the volumes by
error targets, I would agree.

As long as pool fullness is maintained, this 'denial of service' is not
really random but consistent. However if something was done to e.g. drop
a snapshot, and space would be freed, then writes would continue after.

Xen

7 years ago

Instead of responding here individually I just sought to clarify in my
other email that I did not intend to mean by "kernel feature" any form
of max snapshot constraint mechanism.

At least nothing that would depend on size of snapshots.

Post by Zdenek Kabelac
And you also have project that do try to integrate shared goals like btrfs.

Without using disjunct components.

So they solve human problem (coordination) with technical solution (no
more component-based design).

Post by Zdenek Kabelac
We hope community will provide some individual scripts...
Not a big deal to integrate them into repo dir...

We were trying to identify common cases so that LVM team can write those
skeletons for us.

Post by Zdenek Kabelac
It's mostly about what can be supported 'globally'
and what is rather 'individual' customization.

There are people going to be interested in common solution even if it's
not everyone all at the same time.

Post by Zdenek Kabelac
Which can't be deliver with current thinp technology.
It's simply too computational invasive for our targeted performance.

You misunderstood my intent.

Post by Zdenek Kabelac
So when you 'write' to ORIGIN - your snapshot which becomes bigger in
terms of individual/exclusively owned chunks - so if you have i.e.
configured snapshot to not consume more then XX% of your pool - you
would simply need to recalc this with every update on shared
chunks....

I knew this. But we do not depend for the calculations on CONSUMED SPACE
(and its character/distribution) but only on FREE SPACE.

Post by Zdenek Kabelac
And as has been already said - this is currently unsupportable 'online'

And unnecessary for the idea I was proposing.

Look, I am just trying to get the idea across correctly.

Post by Zdenek Kabelac
Another aspect here is - thin-pool has no idea about 'history' of
volume creation - it doesn't not know there is volume X being
snapshot of volume Y - this all is only 'remembered' by lvm2 metadata
- in kernel - it's always like - volume X owns set of chunks 1...
That's all kernel needs to know for a single thin volume to work.

I know this.

However you would need LVM2 to make sure that only origin volumes are
marked as critical.

Post by Zdenek Kabelac
Unsupportable in 'kernel' without rewrite and you can i.e.
'workaround' this by placing 'error' targets in place of less
important thinLVs...

I actually think that if I knew how to do multithreading in the kernel,
I could have the solution in place in a day...

If I were in the position to do any such work to begin with... :(.

But you are correct that error target is almost the same thing.

Post by Zdenek Kabelac
Imagine you would get pretty random 'denials' of your WRITE request
depending on interaction with other snapshots....

All non-critical volumes would get write requests denied, including
snapshots (even read-only ones).

Post by Zdenek Kabelac
Surely if use 'read-only' snapshot you may not see all related
problems, but such a very minor subclass of whole provisioning
solution is not worth a special handling of whole thin-p target.

Read-only snapshots would also die en masse ;-).

Post by Zdenek Kabelac
You are not 'reserving' any space as the space already IS assigned to
those inactive volumes.

Space consumed by inactive volumes is calculated into FREE EXTENTS for
the ENTIRE POOL.

We need no other data for the above solution.

Post by Zdenek Kabelac
What you would have to implement is to TAKE the space FROM them to
satisfy writing task to your 'active' volume and respect
prioritization...

Not necessary. Reserved space is a metric, not a real thing.

Reserved space by definition is a part of unallocated space.

Post by Zdenek Kabelac
If you will not implement this 'active' chunk 'stealing' - you are
really ONLY shifting 'hit-the-wall' time-frame.... (worth possibly
couple seconds only of your system load)...

Irrelevant. Of course we are employing a measure at 95% full that will
be like error targets replacing all non-critical volumes.

Of course if total mayhem ensues we will still be in trouble.

The idea is that if this total mayhem originates from non-critical
volumes, the critical ones will be unaffected (apart from their
snapshots).

You could flag snapshots of critical volumes also as critical and then
not reserve any space for them so you would have a combined space
reservation.

Then snapshots for critical volumes would live longer.

Again, no consumption metric required. Only free space metrics.

Post by Zdenek Kabelac
In other words - tuning 'thresholds' in userspace's 'bash' script will
give you very same effect as if you are focusing here on very complex
'kernel' solution.

It's just not very complex.

You thought I wanted space consumption metric for all volumes including
snapshots and then invididual attribution of all consumed space.

Not necessary.

The only thing I proposed used negative space (free space).

Zdenek Kabelac

7 years ago

However you would need LVM2 to make sure that only origin volumes are marked
as critical.

'dmeventd' executed binary - which can be a simple bash script called at
threshold level can be tuned to various naming logic.

So far there is no plan to enforce 'naming' or 'tagging' since from user base
feedback we can see numerous ways how to deal with large volume naming
strategies often made by external tools/databases - so enforcing i.e.
specific tag would require changes in larger systems - so when it's compared
with rather simple tuning script of bash script...

I actually think that if I knew how to do multithreading in the kernel, I
could have the solution in place in a day...
If I were in the position to do any such work to begin with... :(.
But you are correct that error target is almost the same thing.

It's the most 'safest' - avoids any sort of further possibly damaging of
filesystem.

Note - typical 'fs' may remount 'ro' at reasonable threshold time - the
precise points depends on workload. If you have 'PB' arrays - surely leaving
5% of free space is rather huge, if you work with GB on fast operation SSD -
taking action at 70% might be better.

If anytime 'during' write users hits 'full pool' - there is currently no
other way then to stop using FS - there are numerous way -

You can replace device with 'error'
You can replace device with 'delay' that splits reads to thin and writes to error

There is just not any way-back - FS should be checked (i.e. full FS could be
'restored' by deleting some files, but in the full thin-pool case 'FS' needs
to get consistent first - so focusing on solving full-pool is like preparing
for missed battle - focus should go into ensuring you not hit full pool and on
the 'sad' occasion of 100% full pool - the worst case scenario is not all the
bad - surely way better then 4 year old experience with old kernel and old
lvm2....

Post by Zdenek Kabelac
What you would have to implement is to TAKE the space FROM them to
satisfy writing task to your 'active' volume and respect
prioritization...

Not necessary. Reserved space is a metric, not a real thing.
Reserved space by definition is a part of unallocated space.

How is this different from having VG with 1TB where you allocate for your
thin-pool only i.e. 90% for thin-pool and you have 10% of free space for
'extension' of thin-pool for your 'critical' moment.

I'm still not seeing any difference - except you would need to invest lot of
energy into handling of this 'reserved' space inside kernel.

With actual versions of lvm2 you can handle these tasks at user-space and
quite early before you reach 'real' out-of-space condition.

Post by Zdenek Kabelac
In other words - tuning 'thresholds' in userspace's 'bash' script will
give you very same effect as if you are focusing here on very complex
'kernel' solution.

It's just not very complex.
You thought I wanted space consumption metric for all volumes including
snapshots and then invididual attribution of all consumed space.

Maybe you can try existing proposed solutions first and showing 'weak' points
which are not solvable by it ?

We all agree we could not store 10G thin volume into 1G thin-pool - so there
will be always the case of having 'full pool'.

Either you could handle reserves of 'early' remount-ro or you keep some
'spare' LV/space in VG you attach to thin-pool 'when' needed...
Having such 'great' level of free choice here is IMHO big advantage as it's
always 'admin' to decide how to use available space in the best way - instead
of keeping 'reserves' somewhere hidden in kernel....

Regards

Zdenek

Zdenek Kabelac

7 years ago

Point being that short-time small snapshots are also perfectly served by thin...

...

if you take into account other the constrain - like necessity of planning
small chunk sizes for thin-pool to have reasonably efficient snapshots,
not so small memory footprint - there are cases where short lived
snapshot is simply better choice.

My root volume is not on thin and thus has an "old-snap" snapshot. If the
snapshot is dropped it is because of lots of upgrades but this is no biggy;
next week the backup will succeed. Normally the root volume barely changes.

And you can really have VERY same behavior WITH thin-snaps.

All you need to do is - 'erase' your inactive thin volume snapshot before
thin-pool switches to out-of-space mode.

You really have A LOT of time (60seconds) to do this - even when thin-pool
hits 100% fullness.

All you need to do is to write your 'specific' maintenance mode that will
'erase' volumes tagged/named with some specific name, so you can easily find
those LVs and 'lvremove' them when thin-pool is getting out of the space.

That's the advantage of 'inactive' snapshot.

If you have snapshot 'active' - you need to kill 'holders' (backup software),
umount volume and remove it.

Again - quite reasonably simple task when you know all 'variables'.

Hardly doable at generic level....

So it would be possible to reserve regular LVM space for thin volumes as well

'reserve' can't really be 'generic'.
Everyone has different view on what is 'safe' reserve.
And you loose a lot of space in unusable reserves...

I.e. think about 2000LV in single thin-pool - and design reserves....
Start to 'think big' instead of focusing on 3 thinLVs...

Post by Zdenek Kabelac
Thin-pool still does not support shrinking - so if the thin-pool
auto-grows to big size - there is not a way for lvm2 to reduce the
thin-pool size...

Ah ;-). A detriment of auto-extend :p.

Yep - that's why we have not enable 'autoresize' by default.

It's admin decision ATM whether the free space in VG should be used by
thin-pool or something else.

It would be better is there would be shrinking support - but it's not yet here...

No if you only kept some statistics that would not amount to all the mapping
data but only to a summary of it.

Why should kernel be doing some complex statistic management ?

(Again 'think-big' - kernel is not supposed to be parsing ALL metadata ALL the
time - really - in this case we could 'drop' all the user-space :) and shift
everything to kernel - and we end with similar complexity of kernel code as
the btrfs has....

Say if you write a bot that plays a board game. While searching for moves the
bot has to constantly perform moves on the board. It can either create new
board instances out of every move, or just mutate the existing board and be a
lot faster.

Such bot KNOW all the combination.. - you are constantly forgetting thin
volume target maps very small portion of the whole metadatata set.

A lot of this information is easier to update than to recalculate, that is,
the moves themselves can modify this summary information, rather than derive
it again from the board positions.

Maybe you should try to write a chess-player then - AFAIK it's purely based on
brutal CPU power and massive library of know 'starts' & 'finish'....

Your simplification proposal 'with summary' seems to be quite innovative here...

This is what I mean by "updating the metadata without having to recalculate it".

When you propose is very different thin-pool architecture - so you should try
to talk with it's authors - I can only provide you with 'lvm2' abstraction
level details.

I cannot change kernel level....

The ideal upstreaming mechanism for a new target is to provide some at least
basic implementation proving the concept can work.

And you should also show how is this complicated kernel code giving any better
result then current user-space solution we provide.

You wouldn't have to keep the mapping information in RAM, just the amount of
blocks attributed and so on. A single number. A few single numbers for each
volume and each pool.

It really means - kernel would need to read ALL data,
and do ALL validation in kernel (which is currently work made in use-space)

Hopefully it's finally cleat at this point.

But if it's not active, can it still 'trace' another volume? Ie. it has to get
updated if it is really a snapshot of something right.

Inactive volume CANNOT change - so it doesn't need to be traced.

If it doesn't get updated (and not written to) then it also does not allocate
new extents.

Allocation of new chunks always happen for an active thin LV.

However volumes that see new allocation happening for them, would then always
reside in kernel memory right.
You said somewhere else that overall data (for pool) IS available. But not for
volumes themselves?

Yes - kernel knows how many 'free' chunks are in POOL.
Kernel does NOT know how many individual chunks belongs to single thinLVs.

Regardless with one volume as "master" I think a non-ambiguous interpretation
arises?

There is no 'master' volume.

All thinLVs are equal - and present only set of mapped chunks.
Just some of them can be mapped by more then one thinLV...

So is or is not the number of uniquely owned/shared blocks known for each
volume at any one point in time?

Unless you parse all metadata and create a big data structures for this info,
you do not have this information available.

Post by Zdenek Kabelac
You can use only very small subset of 'metadata' information for
individual volumes.

But I'm still talking about only summary information...

I'm wondering how would you be updating such summery information in case you
have just simple 'fstrim' information.

To update such info - you would need 'backtrace' ALL the 'released' blocks
for your fstrimed thin volume - figure out how many OTHER thinLV (snapst) were
sharing same blocks - and update all their summary information.

Effectively you again need pretty complex data processing (which is otherwise
ATM happening at user-space level with current design) to be shifted into kernel.

I'm not saying it cannot be done - surely you can reach the goal (just like
btrfs) - but it's simply different design requiring to write completely
different kernel target and all user-land app.

It's not something we can reach with few months of codding...

However with the appropriate amount of user friendliness what was first only
for experts can be simply for more ordinary people ;-).

I assume you overestimate how many people works on the project...
We do the best we can...

I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch kuch,
I right click on a volume in Windows Explorer, select properties, select
ReadyBoost tab, click "Reserve complete volume for ReadyBoost", click okay,
and I'm done.

Do you think it's fair to compare us with MS capacity :) ??

It literally takes some 10 seconds to configure SSD caching on such a machine.
Would probably take me some 2 hours in Linux not just to enter the commands
but also to think about how to do it.

It's the open source world...

So it made no sense to have to "figure this out" on your own. An enterprise
will be able to do so yes.
But why not make it easier...

All which needs to happen is - someone sits and write the code :)
Nothing else is really needed ;)

Hopefully my time invested into this low-level explanation will motivate
someone to write something for users....

Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8 with
LVM 2.02.111 which, by now, is three years old hahaha.

Well we are at 2.02.174 - so I'm really mainly interested for complains
against upstream version of lvm2.

There is not much point in discussing 3 years history...

If the monitoring script can fail, now you need a monitoring script to monitor
the monitoring script ;-).

Maybe you start to see why 'reboot' is not such a bad option...

Post by Zdenek Kabelac
You can always use normal device - it's really about the choice and purpose...

Well the point is that I never liked BTRFS.

Do not take this as some 'advocating' for usage of btrfs.

But all you are proposing here is mostly 'btrfs' design.

lvm2/dm is quite different solution with different goals.

BTRFS has its own set of complexities and people running around and tumbling
over each other in figuring out how to use the darn thing. Particularly with
regards to the how-to of using subvolumes, of which there seem to be many
different strategies.

It's been BTRFS 'solution' how to overcome problems...

And then Red Hat officially deprecates it for the next release. Hmmmmm.

Red Hat simply can't do everything for everyone...

Sometimes there is annoying stuff like not being able to change a volume group
(name) when a PV is missing, but if you remove the PV how do you put it back

You may possibly miss the complexity behind those operations.

But we try to keep them at 'reasonable' minimum.

Again please try to 'think' big when you have i.e. hundreds of PVs attached
over network... used in clusters...

There are surely things, which do look over-complicated when you have just 2
disks in your laptop.....

But as it has been said - we address issues on 'generic' level...

You have states - and transition between states is defined in some way and
applies for systems states XYZ....

I guess certain things are difficult enough that you would really want a book
about it, and having to figure it out is fun the first time but after that a
chore.

Would be nice if someone would have wrote a book about it ;)

You mean use a different pool for that one critical volume that can't run out
of space.
This goes against the idea of thin in the first place. Now you have to give up
the flexibility that you seek or sought in order to get some safety because
you cannot define any constraints within the existing system without
separating physically.

Nope - it's still well within.

Imagine you have a VG with 1TB space,
You create 0.2TB 'userdata' thin-pool with some thins
and you create 0.2TB 'criticalsystem' thin-pool with some thins.

Then you orchestrate growth of those 2 thin-pools according to your rules and
needs - i.e. always need 0.1TB of free space in VG to get some space for
system thin-pool. You may even start to remove 'userdata' thin-pool in case
you would like to get some space for 'cricticalsystem' thin-pool

There is NO solution to protect you again running out of system space when are
overprovissiong.

It always end with having 1TB thin-pool with 2TB volume on it.

You can't fit 2TB into 1TB so at some point in time every overprovisioning is
going to hit dead-end....

I get that... building a wall between two houses is easier than having to
learn to live together.
But in the end the walls may also kill you ;-).
Now you can't share washing machine, you can't share vacuum cleaner, you have
to have your own copy of everything, including bath rooms, toilet, etc.
Even though 90% of the time these things go unused.

When you share - you need to HEAVILY plan for everything.

There is always some price paid.

In many cases it's better to leave your vacuum cleaner unused for 99% of its
time, just to be sure you can take ANYTIME you need....

You may also drop usage of modern CPUs which are 99% left unused....

So of course it's cheaper to share - but is it comfortable??
Does it payoff??

Your pick....

I understand, but does this mean that the NUMBER of free blocks is also always
known?

Thin-pool knows how many blocks are 'free'.

So isn't the NUMBER of used/shared blocks in each DATA volume also known?

It's not known per volume.

All you now is - thin-pool has size X and has free Y blocks.
Pool does not know how many thin-devices are there - unless you scan metadata.

All known info is visible with 'dmsetup status'

Status report exposes all known info for thin-pool and for thin volumes.

All is described in kernel documentation for these DM targets.

What about the 'used space'. Could you, potentially, theoretically, set a
threshold for that? Or poll for that?

Clearly used_space is 'whole_space - free_space'

IF you could change the device mapper, THEN could it be possible to reserve
allocation space for a single volume???

You probably need to start then discussion at more kernel oriented DM list.

- virtual free space for critical volume is smaller than its reserved space
- virtual free space for critical volume is bigger than its reserved space
If bigger, then all the reserved space is necessary to stay free
If smaller, then we don't need as much.

You can implement all this logic with existing lvm2 2.02.174.
Scripting gives you all the power to your hands.

But it probably also doesn't hurt.
So 40GB virtual volume has 5GB free but reserved space is 10GB.
Now real reserved space also becomes 5GB.

Please try to stop thinking within your 'margins' and your 'conditions'
every user/customer has different view - sometimes you simply need to
'think-big' in TiB or PiB ;)....

Already explained few times...

Post by Zdenek Kabelac
With 'reboot' you know where you are - it's IMHO fair condition for this.
With frozen FS and paralyzed system and your 'fsfreeze' operation of
unimportant volumes actually has even eaten the space from thin-pool
which may possibly been used better to store data for important
volumes....

Fsfreeze would not eat more space than was already eaten.

If you 'fsfreeze' - the filesystem has to be put into consistent state -
so all unwritten 'data' & 'metadata' out of your page-cache has to pushed on
your disk.

This will cause very hardly 'predictable' amount of provisioning on your
thin-pool. You can possibly estimate 'maximum' number....

If I freeze a volume only used by a webserver... I will only freeze the
webserver... not anything else?

Number of system apps are doing scans over entire system....
Apps are talking to each other and waiting for answers...
Of course lots it 'transiently' freezed apps would be, because other apps are
not well written for parallel world...

Again - if you have set of constrains - like you have a 'special' volume for
web server which is ONLY used by web server, you can a better decision.

In this case it would be likely better to kill 'web server' and umount volume....

We're going to prevent them from mapping new chunks ;-).

You can't prevent kernel from mapping new chunks....

But you can do ALL in userspace - though ATM you need to possibly use
'dmsetup' commands....

Regards

Zdenek

Xen

7 years ago

Just responding to second part of your email.

Post by Xen
Only manual intervention this one... and last resort only to prevent
crash so not really useful in general situation?

You have 1G thin-pool
You use 10G of thinLV on top of 1G thin-pool
And you ask for 'sane' behavior ??

Why not? Really.

Post by Zdenek Kabelac
Any idea of having 'reserved' space for 'prioritized' applications and
other crazy ideas leads to nowhere.

It already exists in Linux filesystems since long time (root user).

Post by Zdenek Kabelac
https://lwn.net/Articles/104185/

That was cute.

But we're not asking aeroplane to keep flying.

We are asking aeroplane to not take down fuelling plane hovering nearby
too.

Post by Zdenek Kabelac
Well yeah - it's not useless to discuses solution for old releases of lvm2...

This is insignifant difference. There is no point in here for you to
argue this.

Post by Zdenek Kabelac
Lvm2 should be compilable and usable on older distros as well - so
upgrade and do not torture yourself with older lvm2....

I'm not doing anything with my system right now so... upgrading LVM2
would be more torture.

Post by Zdenek Kabelac
And we believe it's fine to solve exceptional case by reboot.

Well it's hard to disagree with that but for me it might take weeks
before I discover the system is offline.

Otherwise most services would probably continue.

So now I need to install remote monitoring that checks the system is
still up and running etc.

If all solutions require more and more and more and more monitoring,
that's not good.

Post by Zdenek Kabelac
Since the effort you would need to put into solve all kernel corner
case is absurdly high compared with the fact 'it's exception' for
normally used and configured and monitored thin-pool....

Well I take you on your word, it is just not my impression it would be
*that* hard but this depends on design I guess and you are the arbiter
on that I guess.

Post by Zdenek Kabelac
So don't expect lvm2 team will be solving this - there are more prio work....

Sure, whatever.

Safety is never prio right ;-).

But anyway.

Post by Xen
Sure but some level of "room reservation" is only to buy time -- or
really perhaps to make sure main system volume doesn't crash when data
volume fills > up by accident.

If the system volume IS that important - don't use it with
over-provisiong!

System-volume is not overprovisioned.

Just something else running in the system....

That will crash the ENTIRE SYSTEM when it fills up.

Even if it was not used by ANY APPLICATION WHATSOEVER!!!

Post by Zdenek Kabelac
The answer is that simple.

But you misunderstand that I was talking about a system volume that was
not a thin volume.

Post by Zdenek Kabelac
You can user different thin-pool for your system LV where you can maintain
snapshot without over-provisioning.

My system LV is not even ON a thin pool.

Post by Zdenek Kabelac
It's way more practical solution the trying to fix OOM problem :)

Aye but in that case no one can tell you to ensure you have
auto-expandable memory ;-) ;-) ;-) :p :p :p.

Post by Xen
Yes email monitoring would be most important I think for most people.

Put mail messaging into plugin script then.
Or use any monitoring software for messages in syslog - this worked
pretty well 20 years back - and hopefully still works well :)

Yeah I guess but I do not have all this knowledge myself about all these
different kinds of softwares and how they work, I hoped that thin LVM
would work for me without excessive need for knowledge of many different
kinds.

Post by Xen
Aye but does design have to be complete failure when condition runs out?

YES

:(.

Post by Xen
I am just asking whether or not there is a clear design limitation
that would ever prevent safety in operation when 100% full (by
accident).

Don't user over-provisioning in case you don't want to see failure.

That's no answer to that question.

Post by Zdenek Kabelac
It's the same as you should not overcommit your RAM in case you do not
want to see OOM....

But with RAM I'm sure you can typically see how much you have and can
thus take account of that, filesystem will report wrong figure ;-).

Post by Xen
I still think theoretically solution would be easy if you wanted it.

My best advice - please you should try to write it - so you would see
more in depth how yours 'theoretical solution' meets with reality....

Well more useful to ask people who know.

--
Highly Evolved Beings do not consider it “profitable” if they benefit at
the expense of another.

Zdenek Kabelac

7 years ago

Post by Xen
Just responding to second part of your email.

Post by Xen
Only manual intervention this one... and last resort only to prevent crash
so not really useful in general situation?

You have 1G thin-pool
You use 10G of thinLV on top of 1G thin-pool
And you ask for 'sane' behavior ??

Why not? Really.

Because all filesystems put on top of thinLV do believe all blocks on the
device actually exist....

Any idea of having 'reserved' space for 'prioritized' applications and
other crazy ideas leads to nowhere.

It already exists in Linux filesystems since long time (root user).

Did I say you can't compare filesystem problem with block level problem ?
If not ;) let's repeat - being out of space in a single filesystem
is completely different fairy-tail with out of space thin-pool.

https://lwn.net/Articles/104185/

That was cute.
But we're not asking aeroplane to keep flying.

IMHO you just don't yet see the parallelism....

And we believe it's fine to solve exceptional case by reboot.

Well it's hard to disagree with that but for me it might take weeks before I
discover the system is offline.

IMHO it's problem of proper monitoring.

Still the same song here - you should actively trying to avoid car-collision,
since trying to resurrect often seriously injured or even dead passenger from
a demolished car is usually very complex job with unpredictable result...

We do put number of 'car-protection' safety mechanism - so the newer tools,
newer kernel the better - but still when you hit the wall in top-speed
you can't expect you just 'walk-out' easily... and it's way cheaper to solve
the problem in way you will NOT crash at all..

Post by Xen
Otherwise most services would probably continue.
So now I need to install remote monitoring that checks the system is still up
and running etc.

Of course you do.

thin-pool needs attention/care :)

Post by Xen
If all solutions require more and more and more and more monitoring, that's
not good.

It's the best we can provide....

So don't expect lvm2 team will be solving this - there are more prio work....

Sure, whatever.
Safety is never prio right ;-).

We are safe enough (IMHO) to NOT loose committed data,
We cannot guarantee stable system though - it's too complex.
lvm2/dm can't be fixing extX/btrfs/XFS and other kernel related issues...
Bold men can step in - and fix it....

If the system volume IS that important - don't use it with over-provisiong!

System-volume is not overprovisioned.

If you have enough blocks in thin-pool to cover all needed block for all
thinLV attached to it - you are not overprovisioning.

Post by Xen
Just something else running in the system....

Use different pools ;)
(i.e. 10G system + 3 snaps needs 40G of data size & appropriate metadata size
to be safe from overprovisioning)

Post by Xen
That will crash the ENTIRE SYSTEM when it fills up.
Even if it was not used by ANY APPLICATION WHATSOEVER!!!

Full thin-pool on recent kernel is certainly NOT randomly crashing entire
system :)

If you think it's that case - provide full trace of crashed kernel and open BZ
- just be sure you use upstream Linux...

Post by Xen
My system LV is not even ON a thin pool.

Again - if you reproduce on kernel 4.13 - open BZ and provide reproducer.
If you use older kernel - take a recent one and reproduce.

If you can't reproduce - problem has been already fixed.
It's then for your kernel provider to either back-port fix
or give you fixed newer kernel - nothing really for lvm2...

Post by Xen
It's way more practical solution the trying to fix OOM problem :)
Aye but in that case no one can tell you to ensure you have auto-expandable
memory ;-) ;-) ;-) :p :p :p.

I'd probably recommend reading some books about how is memory mapped on a
block device and what are all the constrains and related problems..

Post by Xen
Yes email monitoring would be most important I think for most people.

Put mail messaging into plugin script then.
Or use any monitoring software for messages in syslog - this worked
pretty well 20 years back - and hopefully still works well :)

Yeah I guess but I do not have all this knowledge myself about all these
different kinds of softwares and how they work, I hoped that thin LVM would
work for me without excessive need for knowledge of many different kinds.

We do provide some 'generic' script - unfortunately - every use-case is
basically pretty different set of rules and constrains.

So the best we have is 'auto-extension'
We used to trying to umount - but this has possibly added more problems then
it has actually solved...

Post by Xen
I am just asking whether or not there is a clear design limitation that
would ever prevent safety in operation when 100% full (by accident).

Don't user over-provisioning in case you don't want to see failure.

That's no answer to that question.

There is a lot of technical complexity behind it.....

I'd say the main part is - 'fs' would need to be able to know understand
it's living on provisioned device (something we actually do not want to,
as you can change 'state' in runtime - so 'fs' should be aware & unaware
at the same time ;) - checking with every request that thin-provisioning
is in the place would impact performance, doing in mount-time make it
also bad.

Then you need to deal with fact, that writes to filesystem are 'process'
aware, while writes to block-device are some anonymous page writes for your
page cache.
Have I said the level of problems for a single filesystem is totally different
story yet ?

So in a simple statement - thin-p has it's limits - if you are unhappy with
them, then you probably need to look for some other solution - or starting
sending patches and improve things around...

It's the same as you should not overcommit your RAM in case you do not
want to see OOM....

But with RAM I'm sure you can typically see how much you have and can thus
take account of that, filesystem will report wrong figure ;-).

Unfortunately you cannot....

Number of your free RAM is very fictional number ;) and you run in much bigger
problems if you start overcommiting memory in kernel....

You can't compare your user-space failing malloc and OOM crashing Firefox....

Block device runs in-kernel - and as root...
There are no reserves, all you know is you need to write block XY,
you have no idea what is the block about..
(That's where ZFS/Btrfs was supposed to excel - they KNOW.... :)

Regard

Zdenek

Eric Ren

7 years ago

Hi Zdenek,

On 09/11/2017 09:11 PM, Zdenek Kabelac wrote:

[..snip...]

Post by Zdenek Kabelac
So don't expect lvm2 team will be solving this - there are more prio work....

Sorry for interrupting your discussion. But, I just cannot help to ask:

It's not the first time I see "there are more prio work". So I'm
wondering: can upstream
consider to have these high priority works available on homepage [1] or
trello tool [1]?

I really hope upstream can do so. Thus,

1. Users can expect what changes will likely happen for lvm.

2. It helps developer reach agreements on what problems/features should
be on high
priority and avoid overlap efforts.

I know all core developer are working for RedHat. But, I guess you
experts will also be happy
to see any real contributions from other engineers. For me, some big
issues in LVM I can see by now:

- lvmetad slows down activation much if there are a lot of PVs on system
(say 256 PVs, it takes >10s to pvscan
in my testing).
- pvmove is slow. I know it's not fault of LVM. The time is almost spent
in DM (the IO dispatch/copy).
- snapshot cannot be used in cluster environment. There is a usecase:
user has a central backup system
running on a node. They want to make snapshot and backup some LUNs
attached to other nodes, on this
backup system node.

If our upstream have a place to put and discuss what the prio works are,
I think it will encourage me to do
more contributions - because I'm not 100% sure if it's a real issue and
if it's a work that upstream hopes
to see, every engineer wants their work to be accepted by upstream :)
I can try to go forward to do meaningful
work (research, testing...) as far as I can, if you experts can confirm
that "that's a real problem. Go ahead!".

[1] https://sourceware.org/lvm2/
[2] https://trello.com/

Regards,
Eric

Zdenek Kabelac

7 years ago

Post by Eric Ren
Hi Zdenek,
[..snip...]

Post by Zdenek Kabelac
So don't expect lvm2 team will be solving this - there are more prio work....

can upstream
consider to have these high priority works available on homepage [1] or trello
tool [1]?
I really hope upstream can do so. Thus,
1. Users can expect what changes will likely happen for lvm.
2. It helps developer reach agreements on what problems/features should be on
high
priority and avoid overlap efforts.

lvm2 is using upstream community BZ located here:

https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper

You can check RHBZ easily for all lvm2 bZ
(mixes RHEL/Fedora/Upstream)

We usually want to have upstream BZ being linked with Community BZ,
but sometimes it's driven through other channel - not ideal - but still easily
search-able.

Post by Eric Ren
- lvmetad slows down activation much if there are a lot of PVs on system (say
256 PVs, it takes >10s to pvscan
in my testing).

It's should be opposite case - unless something regressed recently...
Easiest is to write out lvm2 test suite some test.

And eventually bisect which commit broke it...

Post by Eric Ren
- pvmove is slow. I know it's not fault of LVM. The time is almost spent in DM
(the IO dispatch/copy).

Yeah - this is more or less design issue inside kernel - there are
some workarounds - but since primary motivation was not to overload
system - it's been left a sleep a bit - since focus gained 'raid' target
and these pvmove fixes are working with old dm mirror target...
(i.e. try to use bigger region_size for mirror in lvm.conf (over 512K)
and evaluate performance - there is something wrong - but core mirror
developer is busy with raid features ATM....

Post by Eric Ren
- snapshot cannot be used in cluster environment. There is a usecase: user has
a central backup system

Well, snapshot CANNOT work in cluster.
What you can do is to split snapshot and attach it different volume,
but exclusive assess is simply required - there is no synchronization of
changes like with cmirrord for old mirror....

Post by Eric Ren
If our upstream have a place to put and discuss what the prio works are, I
think it will encourage me to do
more contributions - because I'm not 100% sure if it's a real issue and if

You are always welcome to open Community BZ (instead of trello/github/.... )
Provide justification, present patches.

Of course I cannot hide :) RH has some sort of influence which bugs are more
important then the others...

Post by Eric Ren
it's a work that upstream hopes
to see, every engineer wants their work to be accepted by upstream :) I can
try to go forward to do meaningful
work (research, testing...) as far as I can, if you experts can confirm that
"that's a real problem. Go ahead!".

We do our best....

Regards

Zdenek

Eric Ren

7 years ago

Hi Zdenek,

Post by Zdenek Kabelac
https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper
You can check RHBZ easily for all lvm2 bZ
(mixes RHEL/Fedora/Upstream)
We usually want to have upstream BZ being linked with Community BZ,
but sometimes it's driven through other channel - not ideal - but
still easily search-able.

Yes, it's a place where problems are discussed. Thanks for your reminder :)

Post by Zdenek Kabelac
[...snip...]
It's should be opposite case - unless something regressed recently...
Easiest is to write out lvm2 test suite some test.
And eventually bisect which commit broke it...

Good to know! I will find time to test different versions on both
openSUSE and Fedora.

Post by Eric Ren
- pvmove is slow. I know it's not fault of LVM. The time is almost
spent in DM (the IO dispatch/copy).

Aha, it's a good reason. Ideally, it would be good for pvmove having
some option to control
the IO rate. I know it's not easy...

Post by Zdenek Kabelac
and these pvmove fixes are working with old dm mirror target...
(i.e. try to use bigger region_size for mirror in lvm.conf (over 512K)
and evaluate performance - there is something wrong - but core mirror
developer is busy with raid features ATM....

Thanks for the suggestion.

Post by Eric Ren
user has a central backup system

Got it! Advanced features like snapshot/thinp/dmcache, have their own
metadata. The payment for having those metadata
changes cluster-aware is painful.

Post by Zdenek Kabelac
We do our best....

Like you guys have been always doing, thanks!

Regards,
Eric

David Teigland

7 years ago

Post by Eric Ren
can upstream
consider to have these high priority works available on homepage [1] or
trello tool [1]?

Hi Eric, this is a good question. The lvm project has done a poor job at
this sort of thing. A new homepage has been in the works for a long time,
but I think stalled in the review/feedback stage. It should be unblocked
soon. I agree we should figure something out for communicating about
ongoing or future work (I don't think bugzilla is the answer.)

Eric Ren

7 years ago

Hi David and Zdenek,

On 09/12/2017 01:41 AM, David Teigland wrote:

[...snip...]

Post by David Teigland
Hi Eric, this is a good question. The lvm project has done a poor job at
this sort of thing. A new homepage has been in the works for a long time,
but I think stalled in the review/feedback stage. It should be unblocked
soon. I agree we should figure something out for communicating about
ongoing or future work (I don't think bugzilla is the answer.)

Good news! Thanks very much for you both giving such kind replies :)

Regards,
Eric

David Teigland

7 years ago

Post by Xen
Aye but does design have to be complete failure when condition runs out?

YES

I am not satisfied with the way thin pools fail when space is exhausted,
and we aim to do better. Our goal should be that the system behaves at
least no worse than a file system reaching 100% usage on a normal LV.

Zdenek Kabelac

7 years ago

Post by David Teigland

Post by Xen
Aye but does design have to be complete failure when condition runs out?

YES

We can reach this goal anytime soon - unless we fix all those filesystem....

And there is other metrics - you can make it way more 'safe' for exhausted
space at the prices of massively slowing down a serializing all writes...

I doubt we would find many users that would easily accept massive slowdown of
their system just because thin-pool can run out of space....

Global anonymous page-cache is really a hard thing for resolving...

But when you start to limit your usage of thin-pool with some constrains,
you can get much better behaving system.

i.e. using 'ext4' for mounted 'data' LV should be relatively safe...

And again if you see actual kernel crash OOPS - this is of course a real
kernel bug for fixing...

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
The first question here is - why do you want to use thin-provisioning ?

Because classic LVM snapshot behavior (slow write speed and linear
performance decrease as snapshot count increases) make them useful for
nightly backups only.

On the other side, the very fast CoW thinp's behavior mean very usable
and frequent snapshots (which are very useful to recover from user
errors).

Post by Zdenek Kabelac
As thin-provisioning is about 'promising the space you can deliver
later when needed' - it's not about hidden magic to make the space
out-of-nowhere.

I fully agree. In fact, I was asking about how to reserve space to
*protect* critical thin volumes from "liberal" resource use by less
important volumes. Fully-allocated thin volumes sound very interesting -
even if I think this is a performance optimization rather than a "safety
measure".

Post by Zdenek Kabelac
The idea of planning to operate thin-pool on 100% fullness boundary is
simply not going to work well - it's not been designed for that
use-case - so if that's been your plan - you will need to seek for
other solution.
(Unless you seek for those 100% provisioned devices)

I do *not* want to run at 100% data usage. Actually, I want to avoid it
entirely by setting a reserved space which cannot be used for things as
snapshot. In other words, I would very like to see a snapshot to fail
rather than its volume becoming unavailable *and* corrupted.

Let me de-tour by using ZFS as an example (don't bash me for doing
that!)

In ZFS words, there are object called ZVOLs - ZFS volumes/block devices,
which can either be "fully-preallocated" or "sparse".

By default, they are "fully-preallocated": their entire nominal space is
reseved and subtracted from the ZPOOL total capacity. Please note that
this does *not* means that space is really allocated on the ZPOOL,
rather that nominal space is accounted against other ZFS dataset/volumes
when creating new object. A filesystem sitting on top of such a ZVOL
will never run out of space; rather, if the remaining capacity is not
enough to guaranteed this constrain, new volume/snapshot creating is
forbidden.

Example:
# 1 GB ZPOOL
[***@blackhole ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH
ALTROOT
tank 1008M 456K 1008M - 0% 0% 1.00x ONLINE -

# Creating a 600 MB ZVOL (note the different USED vs REFER values)
[***@blackhole ~]# zfs create -V 600M tank/vol1
[***@blackhole ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 621M 259M 96K /tank
tank/vol1 621M 880M 56K -

# Snapshot creating - please see that, as REFER is very low (I did write
nothig on the volume), snapshot creating is allowed
[***@blackhole ~]# zfs snapshot tank/***@snap1
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 621M 259M 96K /tank
tank/vol1 621M 880M 56K -
tank/***@snap1 0B - 56K -

# Let write something to the volume (note how REFER is higher than free,
unreserved space)
[***@blackhole ~]# zfs destroy tank/***@snap1
[***@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M
count=500 oflag=direct
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 622M 258M 96K /tank
tank/vol1 621M 378M 501M -

# Snapshot creation now FAILS!
[***@blackhole ~]# zfs snapshot tank/***@snap1
cannot create snapshot 'tank/***@snap1': out of space
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 622M 258M 96K /tank
tank/vol1 621M 378M 501M -

The above surely is safe behavior: when free, unused space is too low to
guarantee the reserved space, snapshot creation is disallowed.

On the other side, using the "-s" option you can create a "sparse" ZVOL
- a volume which nominal space is *not* accounted/subtracted from the
total ZPOOL capacity. Such a volume have similar warnings that thin
volumes. From the man page:

'Though not recommended, a "sparse volume" (also known as "thin
provisioning") can be created by specifying the -s option to the zfs
create -V command, or by changing the reservation after the volume has
been created. A "sparse volume" is a volume where the reservation is
less then the volume size. Consequently, writes to a sparse volume can
fail with ENOSPC when the pool is low on space. For a sparse volume,
changes to volsize are not reflected in the reservation.'

The only real difference vs a fully preallocated volume is the property
carrying the reserved space expectation. I can even switch at run-time
between a fully preallocated vs sparse volume by simply changing the
right property. Indeed, a very important thing to understand is that
this property can be set to *any value* between 0 ("none") and max
volume (nominal) size.

On a 600M fully preallocated volumes:
[***@blackhole ~]# zfs get refreservation tank/vol1
NAME PROPERTY VALUE SOURCE
tank/vol1 refreservation 621M local

On a 600M sparse volume:
[***@blackhole ~]# zfs get refreservation tank/vol1
NAME PROPERTY VALUE SOURCE
tank/vol1 refreservation none local

Now, a sparse (refreservation=none) volume *can* be snapshotted even if
very little free space if available in the ZPOOL:

# The very same command that previously failed, now completes
successfully
[***@blackhole ~]# zfs snapshot tank/***@snap1
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 502M 378M 96K /tank
tank/vol1 501M 378M 501M -
tank/***@snap1 0B - 501M -

# Using a non-zero, but lower-than-nominal threshold
(refreservation=100M) allows the snapshot to be taken:
[***@blackhole ~]# zfs set refreservation=100M tank/vol1
[***@blackhole ~]# zfs snapshot tank/***@snap1
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 602M 278M 96K /tank
tank/vol1 601M 378M 501M -
tank/***@snap1 0B - 501M -

# If free space drops under the lower-but-not-zero reservation
(refreservation=100M), snapshot again fails:
[***@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M
count=300 oflag=direct
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 4.85282 s, 64.8 MB/s
[***@blackhole ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 804M 76.3M 96K /tank
tank/vol1 802M 76.3M 501M -
tank/***@snap1 301M - 501M -
[***@blackhole ~]# zfs snapshot tank/***@snap2
cannot create snapshot 'tank/***@snap2': out of space

OK - now back to the original question: why reserved space can be
useful? Consider the following two scenarios:

A) You want to efficiently use snapshots and *never* encounter
unexpected full ZPOOL. Your main constrain it to use at most <50% of
available space for your "critical" ZVOL. With such a setup, any
"excessive" snapshot/volume creation will surely fail, but the main ZVOL
will be unaffected;

B) You want to somewhat overprovision (taking account worst-case
snapshot behavior), but with *large* operating margin. In this case, you
can create a sparse volume with lower (but non-zero) reservation. Any
snapshot/volume creation done when this margin is crossed will fail. You
surely need to clean-up some space (eg: delete older snapshot), but you
avoid the runaway effect of new snapshot being continuously created,
consuming additional space.

Now leave ZWORLD, and back to thinp: it would be *really* cool to
provide the same sort of functionality. Sure, you had to track space
usage both at pool and a volume level - but the safety increase would be
massive. There is an big difference between a corrupted main volume and
a failed snapshot: while the latter can be resolved without too much
concert, the former (volume corruption) really is a scary thing.

Don't misunderstand me, Zdenek: I *REALLY* appreciate you core
developers from the outstanding work on LVM. This is especially true in
the light of BTRFS's problems, and with stratis (which is heavily based
on thinp) becoming the new next thing. I even more appreciate that you
are on the mailing list, replying to your users.

Thin volumes are really cool (and fast!), but they can fail deadly. A
fail-safe approach (ie: no new snapshot allowed) is much more desirable.

Thanks.

Post by Zdenek Kabelac
Regards
Zdenek

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

Post by Zdenek Kabelac
The first question here is - why do you want to use thin-provisioning ?

Because classic LVM snapshot behavior (slow write speed and linear performance
decrease as snapshot count increases) make them useful for nightly backups only.
On the other side, the very fast CoW thinp's behavior mean very usable and
frequent snapshots (which are very useful to recover from user errors).

There is very good reason why thinLV is fast - when you work with thinLV -
you work only with data-set for single thin LV.

So you write to thinLV and either you modify existing exclusively owned chunk
or you duplicate and provision new one. Single thinLV does not care about
other thin volume - this is very important to think about and it's important
for reasonable performance and memory and cpu resources usage.

Post by Zdenek Kabelac
As thin-provisioning is about 'promising the space you can deliver
later when needed' - it's not about hidden magic to make the space
out-of-nowhere.

I fully agree. In fact, I was asking about how to reserve space to *protect*
critical thin volumes from "liberal" resource use by less important volumes.

I think you need to think 'wider'.

You do not need to use a single thin-pool - you can have numerous thin-pools,
and for each one you can maintain separate thresholds (for now in your own
scripting - but doable with today's lvm2)

Why would you want to place 'critical' volume into the same pool
as some non-critical one ??

It's simply way easier to have critical volumes in different thin-pool
where you might not even use over-provisioning.

Seems to me - everyone here looks for a solution where thin-pool is used till
the very last chunk in thin-pool is allocated - then some magical AI step in,
decides smartly which 'other already allocated chunk' can be trashed
(possibly the one with minimal impact :)) - and whole think will continue
run in full speed ;)

Sad/bad news here - it's not going to work this way....

In ZFS words, there are object called ZVOLs - ZFS volumes/block devices, which
can either be "fully-preallocated" or "sparse".
By default, they are "fully-preallocated": their entire nominal space is
reseved and subtracted from the ZPOOL total capacity. Please note that this

Fully-preallocated - sounds like thin-pool without overprovisioning to me...

# Snapshot creating - please see that, as REFER is very low (I did write
nothig on the volume), snapshot creating is allowed

lvm2 also DOES protect you from creation of new thin-pool when the fullness
is about lvm.conf defined threshold - so nothing really new here...

oflag=direct
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
NAME        USED AVAIL REFER MOUNTPOINT
tank        622M   258M    96K /tank
tank/vol1   621M   378M   501M -
# Snapshot creation now FAILS!

ZFS is filesystem.

So let's repeat again :) amount of problems inside a single filesystem is not
comparable with block-device layer - it's entirely different world of problems.

You can't really expect filesystem 'smartness' on block-layer.

That's the reason why we can see all those developers boldly stepping into the
'dark waters' of mixed filesystem & block layers.

lvm2/dm trusts in different concept - it's possibly less efficient,
but possibly way more secure - where you have different layers,
and each layer could be replaced and is maintained separately.

The above surely is safe behavior: when free, unused space is too low to
guarantee the reserved space, snapshot creation is disallowed.

ATM thin-pool cannot somehow auto-magically 'drop' snapshots on its own.

And that's the reason why we have those monitoring features provided with
dmeventd. Where you monitor occupancy of thin-pool and when the
fullness goes above defined threshold - some 'action' needs to happen.

It's really up-to admin to decide if it's more important to make some
free space for existing user writing his 10th copy of 16GB movie :) or erase
some snapshot with some important company work ;)

Just don't expect it will be some magical AI built-in into thin-pool to do
such decision :)

User already has ALL the power to do this work - the main condition here is -
this happens much earlier then your thin-pool gets exhausted!

It's really pointless trying to solve this issue after you are already
out-of-space...

Now leave ZWORLD, and back to thinp: it would be *really* cool to provide the
same sort of functionality. Sure, you had to track space usage both at pool
and a volume level - but the safety increase would be massive. There is an big
difference between a corrupted main volume and a failed snapshot: while the
latter can be resolved without too much concert, the former (volume
corruption) really is a scary thing.

AFAIK current kernel (4.13) with thinp & ext4 used with remount-ro on error
and lvm2 is safe to use in case of emergency - so surely you can lose some
uncommited data but after reboot and some extra free space made in thin-pool
you should have consistent filesystem without any damage after fsck.

There are not known simple bugs in this case - like system crashing on dm
related OOPS (like Xen seems to suggest... - we need to see his bug report...)

However - when thin-pool gets full - the reboot and filesystem check is
basically mandatory - there is no support (and no plan to start support
randomly dropping allocated chunks from other thin-volumes to make space for
your running one)

Thin volumes are really cool (and fast!), but they can fail deadly. A

I'd like to still see what you think is 'deadly'

And also I'd like to be explained what better thin-pool can do in terms
of block device layer.

As said in past - if you would modify filesystem to start to reallocate its
metadata and data to provisioned space - so FS would be AWARE which blocks
are provisioned or uniquely owned... and start working with 'provisioned'
volume differently - that would be a very different story - it essentially
means you would need to write quite new filesystem, since extX not xfs is not
really perfect match....

So all I'm saying here is - 'thin-pool' on block layer is doing 'mostly' its
best to avoid losing user's committed! data - but of course if 'admin' has
failed to fulfill his promise and add more space to overprovisioned
thin-pool, something not-nice will happen to the system - and there is no way
thin-pool on its own may resolve it - it should have been resolved much much
sooner with monitoring via dmeventd - that's the place you should focus on
implementing smart way how to protect you system going ballistic....

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
There is very good reason why thinLV is fast - when you work with thinLV -
you work only with data-set for single thin LV.
So you write to thinLV and either you modify existing exclusively owned chunk
or you duplicate and provision new one. Single thinLV does not care about
other thin volume - this is very important to think about and it's
important for reasonable performance and memory and cpu resources usage.

Sure, I grasp that.

Post by Zdenek Kabelac
I think you need to think 'wider'.
You do not need to use a single thin-pool - you can have numerous thin-pools,
and for each one you can maintain separate thresholds (for now in your own
scripting - but doable with today's lvm2)
Why would you want to place 'critical' volume into the same pool
as some non-critical one ??
It's simply way easier to have critical volumes in different thin-pool
where you might not even use over-provisioning.

I need to take a step back: my main use for thinp is virtual machine
backing store. Due to some limitation in libvirt and virt-manager, which
basically do not recognize thin pools, I can not use multiple thin pools
or volumes.

Rather, I had to use a single, big thin volumes with XFS on top.

Post by Zdenek Kabelac
Seems to me - everyone here looks for a solution where thin-pool is used
till the very last chunk in thin-pool is allocated - then some magical
AI step in,
decides smartly which 'other already allocated chunk' can be trashed
(possibly the one with minimal impact :)) - and whole think will continue
run in full speed ;)
Sad/bad news here - it's not going to work this way....

No, I absolutely *do not want* thinp to automatically dallocate/trash
some provisioned blocks. Rather, I all for something as "if free space
is lower than 30%, disable new snapshot *creation*"

Post by Zdenek Kabelac
lvm2 also DOES protect you from creation of new thin-pool when the fullness
is about lvm.conf defined threshold - so nothing really new here...

Maybe I am missing something: this threshold is about new thin pools or
new snapshots within a single pool? I was really speaking about the latter.

Post by Gionatan Danti
count=500 oflag=direct
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
NAME USED AVAIL REFER MOUNTPOINT
tank 622M 258M 96K /tank
tank/vol1 621M 378M 501M -
# Snapshot creation now FAILS!

...

In the examples above, I did not use any ZFS filesystem layer. I used
ZFS as volume manager, with the intent to place an XFS filesystem on top
of ZVOL block volumes.

The ZFS man page clearly warns about ENOSP with sparse volume. My point
is that, by cleaver using of the refreservation property, I can engineer
a setup where snapshot are generally allowed, unless free space is under
a certain threshold. In this case, the are not allowed (but newer
automatically deleted!).

Post by Zdenek Kabelac
lvm2/dm trusts in different concept - it's possibly less efficient,
but possibly way more secure - where you have different layers,
and each layer could be replaced and is maintained separately.

And I really trust layer separation - it is for this very reason I am a
big fan of thinp, but its fail behavior somewhat scares me.

Post by Zdenek Kabelac
ATM thin-pool cannot somehow auto-magically 'drop' snapshots on its own.

Let me repeat: I do *not* want thinp to automatically drop anything. I
simply what it to disallow new snapshot/volume creation when unallocated
space is too low

Post by Zdenek Kabelac
And that's the reason why we have those monitoring features provided
with dmeventd. Where you monitor occupancy of thin-pool and when the
fullness goes above defined threshold - some 'action' needs to happen.

And I really thank you for that - this is a big step forward.

Post by Zdenek Kabelac
AFAIK current kernel (4.13) with thinp & ext4 used with remount-ro on
error and lvm2 is safe to use in case of emergency - so surely you can
lose some uncommited data but after reboot and some extra free space
made in thin-pool you should have consistent filesystem without any
damage after fsck.
There are not known simple bugs in this case - like system crashing on
dm related OOPS (like Xen seems to suggest... - we need to see his bug
report...)
However - when thin-pool gets full - the reboot and filesystem check is
basically mandatory - there is no support (and no plan to start
support randomly dropping allocated chunks from other thin-volumes to
make space for your running one)
I'd like to still see what you think is 'deadly'

...

Committed (fsynced) writes are safe, and this is very good. However,
*many* application do not properly issue fsync(); this is a fact of life.

I absolutely *do not expect* thinp to automatically cope well with this
applications - I full understand & agree that application *must* issue
proper fsyncs.

However, recognizing that real world is quite different from my ideals,
I want to exclude how many problems are possible: for this reason, I
really want to prevent full thin pools even in the face of failed
monitoring (or somnolent sysadmins).

In the past, I testified that XFS take its relatively long time to
recognize that a thin volume is unavailable - and many async writes can
be lost in the process. Ext4 + data=journaled did a better job, but a)
it is not the default filesystem in RH anymore and b) data=journaled is
not the default option and has its share of problems.

Complex systems need to be monitored - true. And I do that; in fact, I
have *two* monitor system in place (Zabbix and custom shell based one).
However, being bitten from a failed Zabbix Agent in the past, I learn a
good lesson: to design system where some types of problems can not
simply happen.

So, if in the face of a near-full pool, thinp refuse me to create a new
filesystem, I would be happy :)

Post by Zdenek Kabelac
And also I'd like to be explained what better thin-pool can do in terms
of block device layer.

Thinp is doing a great job, and nobody wants to deny that.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

Post by Zdenek Kabelac
There is very good reason why thinLV is fast - when you work with thinLV -
you work only with data-set for single thin LV.
Sad/bad news here - it's not going to work this way....

No, I absolutely *do not want* thinp to automatically dallocate/trash some
provisioned blocks. Rather, I all for something as "if free space is lower
than 30%, disable new snapshot *creation*"

# lvs -a
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
[lvol0_pmspare] vg ewi------- 2,00m

lvol1 vg Vwi-a-tz-- 20,00m pool 40,00

pool vg twi-aotz-- 10,00m 80,00 1,95

[pool_tdata] vg Twi-ao---- 10,00m

[pool_tmeta] vg ewi-ao---- 2,00m

[***@linux export]# lvcreate -V10 vg/pool
Using default stripesize 64,00 KiB.
Reducing requested stripe size 64,00 KiB to maximum, physical extent size
32,00 KiB.
Cannot create new thin volume, free space in thin pool vg/pool reached
threshold.

# lvcreate -s vg/lvol1
Using default stripesize 64,00 KiB.
Reducing requested stripe size 64,00 KiB to maximum, physical extent size
32,00 KiB.
Cannot create new thin volume, free space in thin pool vg/pool reached
threshold.

# grep thin_pool_autoextend_threshold /etc/lvm/lvm.conf
# Configuration option activation/thin_pool_autoextend_threshold.
# thin_pool_autoextend_threshold = 70
thin_pool_autoextend_threshold = 70

So as you can see - lvm2 clearly prohibits you to create a new thinLV
when you are above defined threshold.

To keep things single for a user - we have a single threshold value.

So what else is missing ?

Post by Zdenek Kabelac
lvm2 also DOES protect you from creation of new thin-pool when the fullness
is about lvm.conf defined threshold - so nothing really new here...

Maybe I am missing something: this threshold is about new thin pools or new
snapshots within a single pool? I was really speaking about the latter.

Yes - threshold applies to 'extension' as well as to creation of new thinLV.
(and snapshot is just a new thinLV)

Let me repeat: I do *not* want thinp to automatically drop anything. I simply
what it to disallow new snapshot/volume creation when unallocated space is too
low

as said - already implemented....

Committed (fsynced) writes are safe, and this is very good. However, *many*

application do not properly issue fsync(); this is a fact of life.
I absolutely *do not expect* thinp to automatically cope well with this
applications - I full understand & agree that application *must* issue proper
fsyncs.

Unfortunatelly lvm2 nor dm can be responsible for whole kernel logic and
all user-land apps...

Yes - anonymous pages cache is somewhat Achilles' heel - but it's not a
problem of thin-pool - all other 'provisioning' systems has some troubles....

So we really cannot fix it here.

You would need to prove that different strategy is better and fix linux kernel
for this.

Until this moment - you need use well written user-land apps :) properly
syncing written data - or not use thin-provisioning (and others).

You can also minimize amount of 'dirty' pages to avoid loosing too much data
in case you hit full thin-pool unexpectedly.....

You can sync every second to minimize amount of dirty pages....

Lots of things.... all of them will in some other the other impact system
performance....

In the past, I testified that XFS take its relatively long time to recognize
that a thin volume is unavailable - and many async writes can be lost in the
process. Ext4 + data=journaled did a better job, but a) it is not the default
filesystem in RH anymore and b) data=journaled is not the default option and
has its share of problems.

journaled is very 'secure' - but also very slow....

So depends what you aim for.

But this really cannot be solved on DM side...

So, if in the face of a near-full pool, thinp refuse me to create a new
filesystem, I would be happy :)

So you are already happy right :) ?
Your wish is upstream already for quite some time ;)

Regards

Zdenbek

Xen

7 years ago

Post by Zdenek Kabelac
Unfortunatelly lvm2 nor dm can be responsible for whole kernel logic and
all user-land apps...

What Gionatan also means, or at least what I mean here is,

If functioning is chain and every link can be the weakest link.

Then sometimes you can build in a little redundancy so that other weak
links do not break so easily. Or that your part can cover it.

Linux has had a mindset of reducing redundancy lately. So every bug
anywhere can break the entire thing.

Example was something I advocated for myself.

I installed GRUB2 inside PV reserved space.

That means 2nd sector had PV, 1st sector had MBR-like boot sector.

libblkid stopped at MBR and did not recognise PV.

Now because udev required libblkid to recognise PVs it did not recognise
and did not activate it.

Problem.

Weakest link in this case libblkid.

Earlier vgchange -ay worked flawlessly (and had some redundancy) but was
no longer used.

So you can see how small things can break entire system. Not good
design.

Firmware RAID signature at end of drive also breaks system.

Not good design.

Post by Zdenek Kabelac
You can also minimize amount of 'dirty' pages to avoid loosing too much data
in case you hit full thin-pool unexpectedly.....

Torvalds advocated this.

Post by Zdenek Kabelac
You can sync every second to minimize amount of dirty pages....
Lots of things.... all of them will in some other the other impact
system performance....

He said no people would be hurt by such a measure except people who
wanted to unpack and compile kernel pure in page buffers ;-).

p***@yahoo.com

7 years ago

Is this the same xen? Because that was an actually intelligent and logical response. But like was the case a year ago there is no data or state sharing between the fs and LVM block so what you want is not possible and will never see the light of day. Use ZFS or btrfs or something else. Full stop.

Zdenek Kabelac

7 years ago

...

This bug has been reported (by me even to libblkid maintainer) AND already
fixed already in past....

Yes - surprise software has bugs...

But to defend a bit libblkid maintainer side :) - this feature was not really
well documented from lvm2 side...

You can sync every second to minimize amount of dirty pages....
Lots of things.... all of them will in some other the other impact
system performance....

He said no people would be hurt by such a measure except people who wanted to
unpack and compile kernel pure in page buffers ;-).

So clearly you need to spend resources effectively and support both groups...
Sometimes is better to use large RAM (common laptops have 32G of RAM nowadays)
Sometimes is better to have more 'data' securely and permanently stored...

Regards

Zdenek

Xen

7 years ago

Post by Zdenek Kabelac
This bug has been reported (by me even to libblkid maintainer) AND
already fixed already in past....

I was the one who reported it.

This was Karel Zak's message from 30 august 2016:

"On Fri, Aug 19, 2016 at 01:14:29PM +0200, Karel Zak wrote:
On Thu, Aug 18, 2016 at 10:39:30PM +0200, Xen wrote:
Would someone be will to fix the issue that a Physical Volume from LVM2
(PV)
when placed directly on disk (no partitions or partition tables) will
not be

This is very unusual setup, but according to feedback from LVM guys
it's supported, so I will improve blkid to support it too.

Fixed in the git tree (for the next v2.29). Thanks.

Karel"

So yes, I knew what I was talking about.

At least slightly ;-).

:p.

Post by Zdenek Kabelac
But to defend a bit libblkid maintainer side :) - this feature was not
really well documented from lvm2 side...

That's fine.

Post by Zdenek Kabelac
You can sync every second to minimize amount of dirty pages....
Lots of things.... all of them will in some other the other impact
system performance....

He said no people would be hurt by such a measure except people who
wanted to unpack and compile kernel pure in page buffers ;-).

So clearly you need to spend resources effectively and support both groups...
Sometimes is better to use large RAM (common laptops have 32G of RAM nowadays)

Yes and he said those people wanting to compile the kernel purely in
memory (without using RAM disk for it) have issues anyway...

;-).

So no it is not that clear that you need to support both groups.
Certainly not by default.

Or at least not in its default configuration for some dirty page file
flag ;-).

Gionatan Danti

7 years ago

On 12/09/2017 14:03, Zdenek Kabelac wrote:> # lvs -a

Post by Zdenek Kabelac
LV VG Attr LSize Pool Origin Data% Meta% Move
Log Cpy%Sync Convert
[lvol0_pmspare] vg ewi------- 2,00m
lvol1 vg Vwi-a-tz-- 20,00m pool 40,00
pool vg twi-aotz-- 10,00m 80,00 1,95
[pool_tdata] vg Twi-ao---- 10,00m
[pool_tmeta] vg ewi-ao---- 2,00m
Using default stripesize 64,00 KiB.
Reducing requested stripe size 64,00 KiB to maximum, physical extent
size 32,00 KiB.
Cannot create new thin volume, free space in thin pool vg/pool
reached threshold.
# lvcreate -s vg/lvol1
Using default stripesize 64,00 KiB.
Reducing requested stripe size 64,00 KiB to maximum, physical extent
size 32,00 KiB.
Cannot create new thin volume, free space in thin pool vg/pool
reached threshold.
# grep thin_pool_autoextend_threshold /etc/lvm/lvm.conf
# Configuration option activation/thin_pool_autoextend_threshold.
# thin_pool_autoextend_threshold = 70
thin_pool_autoextend_threshold = 70
So as you can see - lvm2 clearly prohibits you to create a new thinLV
when you are above defined threshold.

...

Hi Zdenek,
this is very good news (for me at least). Thank you very much for
pointing me that!

Anyway, I can not find the relative configuration variable in lvm.conf.
I am on 2.02.166(2)-RHEL7, should I use a newer LVM version to set this
threshold?

Post by Zdenek Kabelac
To keep things single for a user - we have a single threshold value.
So what else is missing ?

This is a very good step, indeed. However, multiple threshold (maybe
attached/counted against the single thin volume, in a manner similar to
how refreservation does for ZVOLs) would be even better (in my use case,
at least).

Post by Zdenek Kabelac
Unfortunatelly lvm2 nor dm can be responsible for whole kernel logic and
all user-land apps...

Again, I am *not* saying, nor asking, that.

I would simply like to use thinp without fearing that "forgotten"
snapshot fill up the thin pool. I have shown how this can easily
achieved with ZVOLs and careful use/setting of the refreservation value,
without any upper layer knowledge and/or intra-layer communications.

Post by Zdenek Kabelac
So you are already happy right :) ?

Sure! :)
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

matthew patton

7 years ago

Post by Gionatan Danti
Let me de-tour by using ZFS as an example

with the obvious caveat that in ZFS the block layer and the file layers are VERY tightly coupled. LVM and the block layer see eye-to-eye but ext4 et. al. have absolutely (almost?) no clue what's going on beneath it and thus LVM is making (false) guarantees that the filesystem is relying upon to actually be true.

IMO Thin-Pool is like waving around a lit welding torch - it's incredibly useful to do certain tasks but you can easily burn yourself and the building down if you don't handle it properly.

Gionatan Danti

7 years ago

Post by matthew patton
with the obvious caveat that in ZFS the block layer and the file
layers are VERY tightly coupled. LVM and the block layer see
eye-to-eye but ext4 et. al. have absolutely (almost?) no clue what's
going on beneath it and thus LVM is making (false) guarantees that the
filesystem is relying upon to actually be true.

Sure, but in the previous examples, I did *not* use the ZFS filesystem
part; rather, I used it as a logical volume manager to carve out block
devices to be used by other, traditional filesystems.

The entire discussion stems from the idea to let thinp reserve some
space to avoid a full pool, by denying new snapshot and volume creation
when a free space threshold is crossed.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

matthew patton

7 years ago

Post by Gionatan Danti
I need to take a step back: my main use for thinp is virtual machine
backing store

...

Post by Gionatan Danti
Rather, I had to use a single, big thin volumes with XFS on top.

...

Post by Gionatan Danti
I used ZFS as volume manager, with the intent to place an XFS filesystem on top

Good grief, you had integration (ZFS) and then you broke it. The ZFS as block or as filesystem is just symantics. While you're at it dig into libvirt and see if you can fix it's silliness.

Post by Gionatan Danti
provisioned blocks. Rather, I all for something as "if free space is lower than 30%, disable new snapshot *creation*"

Say you allowed a snapshot to be created when it was 31%. And 100 milliseconds later you had 2 more all ask for a snapshot and they succeeded. But 2 seconds later just one of your snapshot writers decided to write till it ran off the end of available space. What have you gained?

Post by Gionatan Danti
is that, by cleaver using of the refreservation property, I can engineer

You're not being nearly clever enough. You're using the wrong set of tools and making unsupported assumptions about future writes.

Post by Gionatan Danti
Committed (fsynced) writes are safe

FSync'd where? Inside your client VM? The hell they're safe. Your hypervisor is under no obligation to honor a write request issued to XFS as if it's synchronous.
Is XFS at the hypervisor being mounted 'sync'? That's not nearly enough though. You can also prove that there is a direct 1:1 map between the client VM's aggregate of FSync inspired blocks and general writes being de-staged at the same time it gets handed off to the hypervisor's XFS with the same atomicity? And furthermore when your client VM's kernel ACK's the FSYNC it is saying so without having any idea that the write actually made it. It *thought* it had done all it was supposed to do. Now the user software as well as the VM kernel are being actively misled!

You're going about this completely wrong.

You have to push the "did my write actually succeed or not and how do I recover" to inside the client VM. Your client VM either gets issued a block device that is iSCSI (can be same host) or 'bare metal' LVM on the hypervisor. That's the ONLY way to make sure the I/O's don't get jumbled and errors map exactly. Otherwise for application scribble, the client VM mounts an NFS share that can be thinLV+XFS at the fileserver. Or buy a proper enterprise storage array (they are dirt-cheap used, off maint) where people far smarter than you have solved this problem decades ago.

Post by Gionatan Danti
really want to prevent full thin pools even in the face of failed

And yet you have demonstrated no ability to do so. Or at least have a very naive notion of what happens when multiple, simultaneous actors are involved. It sounds like some of your preferred toolset is letting you down. Roll up your sleeves and fix it. Why you give a damn about what filesystem is 'default' in any particular distribution is beyond me. Use the combination that actually works - not "if only this or that were changed it could/might work."

Post by Gionatan Danti
to design system where some types of problems can not simply happen.

And yet you persist on using the dumbest combo available: thin + xfs. No offense to LVM Thin, it works great WHEN used correctly. To channel Apple, "you're holding it wrong".

Gionatan Danti

7 years ago

Hi,

Post by matthew patton

Post by Gionatan Danti
I need to take a step back: my main use for thinp is virtual machine
backing store

...

Post by Gionatan Danti
Rather, I had to use a single, big thin volumes with XFS on top.

...

Post by Gionatan Danti
I used ZFS as volume manager, with the intent to place an XFS filesystem on top

Good grief, you had integration (ZFS) and then you broke it. The ZFS as block or as filesystem is just symantics.

I did for a compelling reason - to use DRBD for realtime replication.
Moreover, this is the *expected* use for ZVOLs.

While you're at it dig into libvirt and see if you can fix it's silliness.

This simply can not be done by a single person in reasonable time, so I
had to find other solution for now...

Post by matthew patton
Say you allowed a snapshot to be created when it was 31%. And 100 milliseconds later you had 2 more all ask for a snapshot and they succeeded. But 2 seconds later just one of your snapshot writers decided to write till it ran off the end of available space. What have you gained?

With the refreservation property we can *avoid* such a situation. Please
re-read my bash examples in the previous email.

Post by matthew patton
FSync'd where? Inside your client VM? The hell they're safe. Your hypervisor is under no obligation to honor a write request issued to XFS as if it's synchronous.

Wrong: Qemu/KVM *does* honors write barrier, unless you use
"cache=unsafe". Other behaviors should be threat as bugs.

Post by matthew patton
Is XFS at the hypervisor being mounted 'sync'? That's not nearly enough though. You can also prove that there is a direct 1:1 map between the client VM's aggregate of FSync inspired blocks and general writes being de-staged at the same time it gets handed off to the hypervisor's XFS with the same atomicity? And furthermore when your client VM's kernel ACK's the FSYNC it is saying so without having any idea that the write actually made it. It *thought* it had done all it was supposed to do. Now the user software as well as the VM kernel are being actively misled!
You're going about this completely wrong.
You have to push the "did my write actually succeed or not and how do I recover" to inside the client VM. Your client VM either gets issued a block device that is iSCSI (can be same host) or 'bare metal' LVM on the hypervisor. That's the ONLY way to make sure the I/O's don't get jumbled and errors map exactly. Otherwise for application scribble, the client VM mounts an NFS share that can be thinLV+XFS at the fileserver. Or buy a proper enterprise storage array (they are dirt-cheap used, off maint) where people far smarter than you have solved this problem decades ago.

...

Again: this is not how Qemu/KVM threats write barriers on the guest
side. Really. You can check the qemu/libvirt mailing list for that.
Bottom line: guest fsynced writes *are absolutely safe.* I even tested
this on my lab by pulling of the plug *tens of times* during heavy IO.

Post by matthew patton
And yet you have demonstrated no ability to do so. Or at least have a very naive notion of what happens when multiple, simultaneous actors are involved. It sounds like some of your preferred toolset is letting you down. Roll up your sleeves and fix it. Why you give a damn about what filesystem is 'default' in any particular distribution is beyond me. Use the combination that actually works - not "if only this or that were changed it could/might work."

The default combination is automatically the most tested one. This will
really pay off when you face some unexptected bug/behavior

Post by matthew patton
And yet you persist on using the dumbest combo available: thin + xfs. No offense to LVM Thin, it works great WHEN used correctly. To channel Apple, "you're holding it wrong".

This is what RedHat is heavily supporting. I see nothing wrong with thin
+ XFS, and both thinp and XFS developers confirm that.

Again: maybe I am missing something?
Thanks.

--
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

Hi,
The default combination is automatically the most tested one. This will really
pay off when you face some unexptected bug/behavior

Post by matthew patton
And yet you persist on using the dumbest combo available: thin + xfs. No
offense to LVM Thin, it works great WHEN used correctly. To channel Apple,
"you're holding it wrong".

This is what RedHat is heavily supporting. I see nothing wrong with thin +
XFS, and both thinp and XFS developers confirm that.
Again: maybe I am missing something?

There are maybe few worthy comments - XFS is great on stanadar big volumes,
but there used to be some hidden details when used on thinly provisioned
volumes on older RHEL (7.0, 7.1)

So now it depend how old distro you use (I'd probably highly recommend upgrade
to RH7.4 if you are on RHEL based distro)

Basically 'XFS' does not have similar 'remount-ro' on error behavior which
'extX' provides - but now XFS knows how to shutdown itself when meta/data
updates starts to fail - although you may need to tune some 'sysfs' params to
get 'ideal' behavior.

Personally for smaller sized thin volumes I'd prefer 'ext4' over XFS - unless
you demand some specific XFS feature...

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
There are maybe few worthy comments - XFS is great on stanadar big
volumes, but there used to be some hidden details when used on thinly
provisioned volumes on older RHEL (7.0, 7.1)
So now it depend how old distro you use (I'd probably highly recommend
upgrade to RH7.4 if you are on RHEL based distro)

Sure.

Post by Zdenek Kabelac
Basically 'XFS' does not have similar 'remount-ro' on error behavior
which 'extX' provides - but now XFS knows how to shutdown itself when
meta/data updates starts to fail - although you may need to tune some
'sysfs' params to get 'ideal' behavior.

True, with a catch: with the default data=ordered option, even ext4 does
*not* remount read only when data writeout fails. You need to use both
"errors=remount-ro" and "data=journal" which basically nobody uses.

Post by Zdenek Kabelac
Personally for smaller sized thin volumes I'd prefer 'ext4' over XFS -
unless you demand some specific XFS feature...

Thanks for the input. So, do you run your ext4 filesystem with
data=journal? How they behave performane-wise?

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

...

As said data=journal is big performance killer (especially on SSD)

Personally I prefer early 'shutdown' in case the situation becomes critical
(i.e. 95% fullness because some process gets crazy)

But you can write any advanced scripting logic to suit best your needs -

i.e. replace all thins on thin-pool with 'error' target....
(which is as simple as using 'dmsetup remove --force'.... - this will
make all future read/writes giving you i/o errors....)

Simply do all in user-space early enough before thin-pool can ever get NEAR t
being 100% full - reaction is really quick - and you have at least 60seconds
to solve the problem in worst case.....

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Gionatan Danti
Wrong: Qemu/KVM *does* honors write barrier, unless you use
"cache=unsafe".

seems the default is now 'unsafe'.
http://libvirt.org/formatdomain.html#elementsDisks

The default is cache=none

Only if the barrier frame gets passed along by KVM and only if you're
running in "directsync" (or perhaps 'none?') mode there is no
guarantee any of it hit the platter. Let's assume a hypervisor I/O is
ahead of VM 'A's barrier frame, and blows up the thinLV. Then yes it's
possible to propagate the ENOSPACE or other error back to the VM 'A'
to realize that write didn't actually succeed. The rapid failure
cascade across resident VMs is not going to end well either.

fsynched writes that hits a full pool returns EIO to the upper layer

But if that's the only condition it works under they why bother with
XFS on top of thin? You already mentioned that XFS is lousy about
actually detecting underlying block layer problems until much too
late. Just provision the Thin LV and map it directly to the VM. If
your choice of hypervisor is broken, fix it, or choose another that
doesn't have the problem. But really, using ThinLV for guest VM and
asking the hypervisor to not blow up is nuts. Buy a friggin disk. Or
put the onus on try/error where it belongs - inside the guest VM. Why
is that so hard?

...

KVM is the most valid GPL hypervisor, and libvirt is the virtualization
library of choice.
But I can not fix/implement thin pool/volumes management alone.

You're a web-hosting company and you're trying to duck the laws of
economics and the reality of running a business where other often
clueless people trust you to keep their data intact?

Please, don't elaborate on things you don't know.
I asked a specific question on the linux-lvm list, and (as always) I
learnt something. I don't see any problem in doing that.

Given the screwball way you're going about handling your customer
data, why are you trying to be 'creative'? Storage is your MOST
important and vital capability and naturally your most expensive. Get
real and spend the money. Storage is where you NEVER cut corners and
you NEVER attempt naive optimization efforts.
I'm not quibbling over XFS, you could have picked EXT4 for all I care.
The point is you're trying to get jiggy with customer data to save
pennies. We have a saying, picking up pennies in front of a road
paver. If you're selling capacity you don't have, it's not only
dishonest, but there are proven and well understood ways to do so (eg.
NFS or thin-alloc iSCSI) aimed at a proper storage head that is
properly managed. But ultimately you HAVE to be able to satisfy the
instant demands of all users even if that means they all suddenly wake
up and want to use what you supposedly sold them and entered into a
contract to supply by taking their money.

...

Again, please don't speak about things you don't know.
I am *not* interested in thin provisioning itself at all; on the other
side, I find CoW and fast snapshots very useful.

Computing/Storage is not a ponzi scheme.

Thanks for remind me that.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

Post by Gionatan Danti
Again, please don't speak about things you don't know.
I am *not* interested in thin provisioning itself at all; on the other side, I
find CoW and fast snapshots very useful.

Not going to comment KVM storage architecture - but with this statemnet -
you have VERY simple usage:

Just minimize chance for overprovisioning -

let's go by example:

you have 10 10GiB volumes and you have 20 snapshots...

to not overprovision - you need 10 GiB * 30 LV = 300GiB thin-pool.

if that sounds too-much.

you can go with 150 GiB - to always 100% cover all 'base' volumes.
and have some room for snapshots.

Now the fun begins - while monitoring is running -
you get callback for 50%, 55%... 95% 100%
at each moment you can do whatever action you need.

So assume 100GiB is bare minimum for base volumes - you ignore any state with
less then 66% occupancy of thin-pool and you start solving problems with 85%
(~128GiB)- you know some snapshot is better to be dropped.
You may try 'harder' actions for higher percentage.
(you need to consider how many dirty pages you leave floating your system
and other variables)

Also you pick with some logic the snapshot which you want to drop -
Maybe the oldest ?
(see airplane :) URL link)....

Anyway - you have plenty of time to solve it still at this moment
without any danger of losing write operation...
All you can lose is some 'snapshot' which might have been present a bit
longer... but that is supposedly fine with your model workflow...

Of course you are getting in serious problem, if you try to keep all these
demo-volumes within 50GiB with massive overprovisioning ;)

There you have much hard times what should happen what should be removed and
where is possibly better to STOP everything and let admin decide what is the
ideal next step....

Regards

Zdenek

Gionatan Danti

7 years ago

...

Hi Zdenek,
I fully agree with what you said above, and I sincerely thank you for
taking the time to reply.
However, I am not sure to understand *why* reserving space for a thin
volume seems a bad idea to you.

Lets have a 100 GB thin pool, and wanting to *never* run out of space in
spite of taking multiple snapshots.
To achieve that, I need to a) carefully size the original volume, b) ask
the thin pool to reserve the needed space and c) counting the "live"
data (REFER in ZFS terms) allocated inside the thin volume.

Step-by-step example:
- create a 40 GB thin volume and subtract its size from the thin pool
(USED 40 GB, FREE 60 GB, REFER 0 GB);
- overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
- snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
- completely overwrite the original volume (USED 80 GB, FREE 20 GB,
REFER 40 GB);
- a new snapshot creation will fails (REFER is higher then FREE).

Result: thin pool is *never allowed* to fill. You need to keep track of
per-volume USED and REFER space, but thinp performance should not be
impacted in any manner. This is not theoretical: it is already working
in this manner with ZVOLs and refreservation, *without*
involing/requiring any advanced coupling/integration between block and
filesystem layers.

Don't get me wrong: I am sure that, if you choose to not implement this
scheme, you have a very good reason to do that. Moreover, I understand
that patches are welcome :)

But I would like to understand *why* this possibility is ruled out with
such firmness.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

...

There could be a simple answer and complex one :)

I'd start with simple one - already presented here -

when you write to INDIVIDUAL thin volume target - respective dn thin target
DOES manipulate with single btree set - it does NOT care there are some other
snapshot and never influnces them -

You ask here to heavily 'change' thin-pool logic - so writing to THIN volume A
can remove/influence volume B - this is very problematic for meny reasons.

We can go into details of BTree updates (that should be really discussed with
its authors on dm channel ;)) - but I think the key element is capturing the
idea the usage of thinLV A does not change thinLV B.

----

Now to your free 'reserved' space fiction :)
There is NO way to decide WHO deserves to use the reserve :)

Every thin volume is equal - (the fact we call some thin LV snapshot is
user-land fiction - in kernel all thinLV are just equal - every thinLV
reference set of thin-pool chunks) -

(for late-night thinking - what would be snapshot of snapshot which is fully
overwritten ;))

So when you now see that all thinLVs just maps set of chunks,
and all thinLVs can be active and running concurrently - how do you want to
use reserves in thin-pool :) ?
When do you decide it ? (you need to see this is total race-lend)
How do you actually orchestrate locking around this single point of failure ;) ?
You will surely come with and idea of having reserve separate for every thinLV ?
How big it should actually be ?
Are you going to 'refill' those reserves when thin-pool gets emptier ?
How you decide which thinLV deserves bigger reserves ;) ??

I assume you can start to SEE the whole point of this misery....

So instead - you can start with normal thin-pool - keep it simple in kernel,
and solve complexity in user-space.

There you can decide - if you want to extend thin-pool...
You may drop some snapshot...
You may fstrim mounted thinLVs...
You can kill volumes way before the situation becomes unmaintable....

All you need to accept is - you will kill them at 95% -
in your world with reserves it would be already reported as 100% full,
with totally unknown size of reserves :)

Regards

Zdenek

Gionatan Danti

7 years ago

...

Ok, this is an answer I totally accept: if enable per-lv used and
reserved space is so difficult in the current thinp framework, don't do
it.

Thanks to taking the time to explain (on late night ;))

Post by Zdenek Kabelac
All you need to accept is - you will kill them at 95% -
in your world with reserves it would be already reported as 100% full,
with totally unknown size of reserves :)

Minor nitpicking: I am not speaking about "reserves" to use when free
space is low, but about "reserved space" - ie: per-volume space which
can not be used by any other object.

One question: in a previous email you shown how a threshold can be set
to deny new volume/snapshot creation. How can I do that? What LVM
version I need?

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

Post by Zdenek Kabelac
There could be a simple answer and complex one :)
I'd start with simple one - already presented here -

There you can decide - if you want to extend thin-pool...

Post by Zdenek Kabelac
You may drop some snapshot...
You may fstrim mounted thinLVs...
You can kill volumes way before the situation becomes unmaintable....

Ok, this is an answer I totally accept: if enable per-lv used and reserved
space is so difficult in the current thinp framework, don't do it.

It's not just about 'complexity' in frame work.
You would lose all the speed as well.
You would significantly raise-up memory requirement.

There is very good reason complex tools like 'thin_ls' are kept in user-space
outside of kernel - with 'dm' we tend to have simpler kernel logic
and complexity should stay in user-space.

And of course - as pointed out - the size of your 'reserve' is so vague :)
and could potentially present major portion of you whole thin-pool size
without any extra benefit (as obviously any reserve could be too small unless
you 'reach' fully provisioned state :)

i.e. example:
10G thinLV with 1G chunks - single byte write may require full 1G chunk...
so do you decide to keep 10 free chunks in reserves ??

...

Supposedly:

lvmconfig --typeconfig full --withversion

# Available since version 2.2.89.
thin_pool_autoextend_threshold=70

However there were some bugs and fixes - and validation for not allowing to
create new thins - so do not try anything below 169 and if you can
go with 173....

Regards

Zdenek

Gionatan Danti

7 years ago

...

What was missing (because I thought it was implicit) is that I expect
snapshot to never change - ie: they are read-only.
Anyway, I was not writing about "resevers" - rather, to
preassign/preallocate the required space to a specific volume.
A fallocate on a otherwise thinly provisioned volume, if you like.

Post by Zdenek Kabelac
lvmconfig --typeconfig full --withversion
# Available since version 2.2.89.
thin_pool_autoextend_threshold=70
However there were some bugs and fixes - and validation for not
allowing to create new thins - so do not try anything below 169 and if
you can
go with 173....

Ah! I was not thinking about thin_pool_autoextend_threshold! I tried
with 166 (for now) and I don't see any major problems. However, I will
surely upgrade at the first opportunity!

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

matthew patton

7 years ago

- create a 40 GB thin volume and subtract its size from the thin pool (USED 40 GB, FREE 60 GB, REFER 0 GB);
- overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
- snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);

And 3 other threads also take snapshots against the same volume, or frankly any other volume in the pool.
Since the next step (overwrite) hasn't happened yet or has written less than 20GB, all succeed.

- completely overwrite the original volume (USED 80 GB, FREE 20 GB, REFER 40 GB);

4 threads all try to write their respective 40GB. Afterall, they got the green-light since their snapshot was allowed to be taken.
Your thinLV blows up spectacularly.

- a new snapshot creation will fails (REFER is higher then FREE).

nobody cares about new snapshot creation attempts at this point.

When do you decide it ? (you need to see this is total race-lend)

exactly!

Gionatan Danti

7 years ago

...

I all the examples I did, the snapshot are suppose to be read-only or at
least never written. I thought that it was implicitly clear due to ZFS
(used as example) being read-only by default. Sorry for not explicitly
stating that.

However, the refreservation mechanism can protect the original volume
even when snapshots are writeable. Here we go:

# Create a 400M ZVOL and fill it
[***@localhost ~]# zfs create -V 400M tank/vol1
[***@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M
oflag=direct
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
401+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 23.0573 s, 18.2 MB/s
[***@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 416M 464M 24K /tank
tank/vol1 414M 478M 401M -

# Create some snapshots (note how the USED value increased due to the
snapshot reserving space for all "live" data in the ZVOL)
[***@localhost ~]# zfs set snapdev=visible tank/vol1
[***@localhost ~]# zfs snapshot tank/***@snap1
[***@localhost ~]# zfs snapshot tank/***@snap2
[***@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 816M 63.7M 24K /tank
tank/vol1 815M 478M 401M -
tank/***@snap1 0B - 401M -
tank/***@snap2 0B - 401M -

# Clone the snapshot (to be able to overwrite it)
[***@localhost ~]# zfs clone tank/***@snap1 tank/cvol1
[***@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 815M 64.6M 24K /tank
tank/cvol1 1K 64.6M 401M -
tank/vol1 815M 479M 401M -
tank/***@snap1 0B - 401M -
tank/***@snap2 0B - 401M -

# Writing to the cloned ZVOL fails (after only 66 MB written) *without*
impacting the original volume
[***@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/cvol1 bs=1M
oflag=direct
dd: error writing ‘/dev/zvol/tank/cvol1’: Input/output error
64+0 records in
63+0 records out
66060288 bytes (66 MB) copied, 25.9189 s, 2.5 MB/s

After the last write, the cloned cvol1 is clearly corrputed, but the
original volume has not problem at all.

Now, I am *not* advocating switching thinp to a ZFS-like things (ie:
note the write speed, which is low even for my super-slow notebook HDD).
However, a mechanism with which we can tell LVM "hey, this volume should
have all its space as reserved, don't worry about preventing snapshots
and/or freezing them when free space runs out".

This was more or less the case with classical, fat LVM: a snapshot
runnig out of space *will* fail, but the original volume remains
unaffected.

Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

> - create a 40 GB thin volume and subtract its size from the thin
pool (USED 40 GB, FREE 60 GB, REFER 0 GB);
> - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
And 3 other threads also take snapshots against the same volume, or
frankly any other volume in the pool.
Since the next step (overwrite) hasn't happened yet or has written
less than 20GB, all succeed.
> - completely overwrite the original volume (USED 80 GB, FREE 20 GB,
REFER 40 GB);
4 threads all try to write their respective 40GB. Afterall, they got
the green-light since their snapshot was allowed to be taken.
Your thinLV blows up spectacularly.
> - a new snapshot creation will fails (REFER is higher then FREE).
nobody cares about new snapshot creation attempts at this point.

When do you decide it ? (you need to see this is total race-lend)

exactly!

...

Ohh this is pretty major constrain ;)

But as pointed out multiple times - with scripting around various fullness
moments of thin-pool - several different actions can be programmed around,
starting from fstrim, ending with plain erase of unneeded snapshot.
(Maybe erasing unneeded files....)

To get most secure application - such app should actually avoid using
page-cache (using direct-io) in such case you are always guaranteed
to get exact error at the exact time (i.e. even without journaled mounting
option for ext4....)

After the last write, the cloned cvol1 is clearly corrputed, but the original
volume has not problem at all.

Surely there is good reason we keep 'old snapshots' still with us - although
everyone knows it's implementation has aged :)

There are cases where this copying into separate COW areas simply works better
- especially for temporary living object with low number of 'small' changes.

We even support old-snapshot for thin-volumes for this reason - so you can
use 'bigger' thin-pool chunks - but for temporary snapshot for taking backups
you can take old snapshot of thin volume...

This was more or less the case with classical, fat LVM: a snapshot runnig out
of space *will* fail, but the original volume remains unaffected.

Partially this might get solved in 'some' cases with fully provisioned thinLVs
within thin-pool...

What comes to my mind as possible supporting solution is -
adding possible enhancement on LVM2 side could be 'forcible' removal of
running volumes (aka lvm2 equivalent of 'dmsetup remove --force')

ATM lvm2 prevents you to remove 'running/mounted' volumes.

I can well imagine LVM will let you forcible replace such LV with error
target - so instead of thinLV - you will have single 'error' target
snapshot - which could be possibly even auto-cleaned once the volume
use-count drops bellow 0 (lvmpolld/dmeventd monitoring whatever...)

(Of course - we are not solving what happens to application using/running out
of such error target - hopefully something not completely bad....)

This way - you get very 'powerful' weapon to be used in those 'scriplets'
so you can drop uneeded volumes ANYTIME you need to and reclaim its resources...

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
Ohh this is pretty major constrain ;)

Sure :p
Sorry for not explicitly stating that before.

Post by Zdenek Kabelac
But as pointed out multiple times - with scripting around various
fullness moments of thin-pool - several different actions can be
programmed around,
starting from fstrim, ending with plain erase of unneeded snapshot.
(Maybe erasing unneeded files....)
To get most secure application - such app should actually avoid using
page-cache (using direct-io) in such case you are always guaranteed
to get exact error at the exact time (i.e. even without journaled
mounting option for ext4....)

...

True, but pagecache exists for a reason. Anyway, this is not anything
you can "fix" in device mapper/lvm, I 100% agree with that.

Post by Zdenek Kabelac
Partially this might get solved in 'some' cases with fully provisioned
thinLVs within thin-pool...
What comes to my mind as possible supporting solution is -
adding possible enhancement on LVM2 side could be 'forcible' removal
of running volumes (aka lvm2 equivalent of 'dmsetup remove --force')
ATM lvm2 prevents you to remove 'running/mounted' volumes.
I can well imagine LVM will let you forcible replace such LV with
error target - so instead of thinLV - you will have single 'error'
target snapshot - which could be possibly even auto-cleaned once the
volume use-count drops bellow 0 (lvmpolld/dmeventd monitoring
whatever...)
(Of course - we are not solving what happens to application
using/running out of such error target - hopefully something not
completely bad....)
This way - you get very 'powerful' weapon to be used in those
'scriplets'
so you can drop uneeded volumes ANYTIME you need to and reclaim its resources...

...

This would be *really* great. I played with dm-setup remove/error target
and, while working, it often freezed LVM.
An integrated forced volume removal/swith to error target would be
great.

Thank.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

Zdenek Kabelac

7 years ago

...

Forcible remove (with some reasonable locking - so i.e. 2 processes are not
playing with same device :) 'dmsetup remove --force' - is replacing
existing device with 'error' target (with built-in noflush)

Anyway - if you see a reproducible problem with forcible removal - it needs to
be reported as this is a real bug then and BZ shall be opened...

Regards

Zdenek

Gionatan Danti

7 years ago

Post by Zdenek Kabelac
Forcible remove (with some reasonable locking - so i.e. 2 processes
are not playing with same device :) 'dmsetup remove --force' - is
replacing
existing device with 'error' target (with built-in noflush)
Anyway - if you see a reproducible problem with forcible removal - it
needs to be reported as this is a real bug then and BZ shall be
opened...
Regards
Zdenek

Ok, I'll do some more test and in case problems arise I'll open the BZ
:)
Thanks.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8

matthew patton

7 years ago

Post by Gionatan Danti
True, with a catch: with the default data=ordered option, even ext4 does
*not* remount read only when data writeout fails. You need to use
both "errors=remount-ro" and "data=journal" which basically nobody uses.

Then you need to hang out with people who actually do storage for a living.

matthew patton

7 years ago

'yes'

The filesystem may not be resident on the hypervisor (dom0) so 'dmsetup suspend' is probably more apropos. How well that propagates upward to the unwary client VM remains to be seen. But if one were running a NFS server using thin+xfs/ext4 then the 'fsfreeze' makes sense.

Zdenek Kabelac

7 years ago

Post by matthew patton
'yes'
The filesystem may not be resident on the hypervisor (dom0) so 'dmsetup suspend' is probably more apropos. How well that propagates upward to the unwary client VM remains to be seen. But if one were running a NFS server using thin+xfs/ext4 then the 'fsfreeze' makes sense.

lvm2 is not 'expecting' someone will touch lvm2 controlled DM devices.

If you suspend thinLV with dmsetup - you are in big 'danger' of freezing
further lvm2 processing - i.e. command will try scan device list and will get
blocked on suspend (even if lvm2 checks for 'suspended' dm devices to skip
then via lvm.conf setting - there is clear race)

So any solution which works outside lvm2 and changes dm table outside of lvm2
locking mechanism is hardly supportable - it can be used as 'last weapon' -
but it should be clear to user - that next proper step is to reboot the
machine....

Regards

Zdenek

matthew patton

7 years ago

I don't recall seeing an actual, practical, real-world example of why this issue got broached again. So here goes.

Create a thin LV on KVM dom0, put XFS/EXT4 on it, lay down (sparse) files as KVM virtual disk files.
Create and launch VMs and configure to suit. For example a dedicated VM for each of web server, a Tomcat server, and database. Let's call it a 'Stack'.
You're done configuring it.

You take a snapshot as a "restore point".
Then you present to your developers (or customers) a "drive-by" clone (snapshot) of the LV in which changes are typically quite limited (but could go up to full capacity) worth of overwrites depending on how much they test/play with it. You could have 500 such copies resident. Thin LV clones are damn convenient and mostly "free" and attractive for that purpose.

At some point one of those snapshots gets launched as, or converted into a production instance. Or if you rather, a customer purchases it and now you must be able to guarantee that it can do a full overwrite of it's space and that any interaction with the underlying thin pool trumps all the other ankle-biters (demo, dev, qa, trial) that might also be resident. Lesser snapshots will necessarily be evicted (destroyed) until the volume reaches some pre-defined level of reserved space that is now solely used for quick point-in-time restore points of the remaining instances. These snaps are retained for some amount of time and likely spooled off to a backup location. If thinPool pressure gets too high the oldest restore points (snapshots) get destroyed.

In any given ThinPool there may be multiple Stacks or flavors/versions of same.

I believe the pseudo-script provided earlier this afternoon suffices to implement the above.

Zdenek Kabelac

7 years ago

...

There is one point which IMHO would be way more worth to invest resource into
ATM whenever you have snapshot - there is unfortunately no page-cache sharing.

So i.e. you have 10 LVs being snapshots of the single origin you get 10
different copies of pages in RAM of the same data.

But this is really hard problem to solve...

Regards

Zdenek

matthew patton

7 years ago

The issue with scripts is that they feel rather vulnerable to corruption, not being there etc.

Only you are responsible for making sure scripts are available and correctly written.

So in that case I suppose that you would want some default, shipped scripts that come with LVM as
example for default behaviour and that are also activated by default?

...

/usr/share/lvm/scripts/

Heck no to activation. The only path that's correct is that last one. The previously supplied example code should have been more than enough for you to venture out on your own and write custom logic.

Then not even a threshold value needs to be configured.

Nobody else wants your particular logic, run interval, or thresholds. Write your scripts to suck in /etc/sysconfig/lvm/<vgname> or whatever the distro of your choice puts such things.

Yes. One obvious scenario is root on thin.
It's pretty mandatory for root on thin.

Can you elaborate with an example? Because that's the most dangerous one not to have space fully reserved unless you've established other means to ensure nothing writes to that volume or the writes are very, very well defined. ie. 'yum update' is disabled, nobody has root, the filesystem is mounted RO, etc.

You cannot set max size for thin snapshots?

And you want to do that to 'root' volumes?!?!

you cannot calculate in advance what can happen,
because by design, mayhem should not ensue, but what if your
predictions are off?

Simple. You don't do stupid things like NOT reserving 100% of space in a thinLV for all root volumes. You buy as many hard drives as necessary so you don't get burned.

Being able to set a maximum snapshot size before it gets dropped could be very nice.

Write your own script that queries all volumes, and destroys those that are beyond your high-water mark unless optionally they are "special".

When free space on thin pool drops below ~120MB

At best your user-space program runs each minute and writing 120MB takes a couple seconds and thus between runs of your 'monitor' you've blown past any chance of taking action.

8TB drives are $250 bucks. buy disk. and buy more disk already and quadruple your notion of reserves. If your workload isn't critical then nobody cares if you blow it sky high and you can do silly things like shaving too close. But I have to ask, to what possible purpose?

I want the 20GB volume and the 10GB volumes to be frozen

It takes time for a script log into each KVM domain and issue an fsfreeze or even to just suspend the VM from the hypervisor. Meanwhile writers are potentially writing at several hundred MB per second. You're looking at a massive torrent of write errors.

much deleted

Sounds like you got all the bits figured out. Write the script and post it to GitHub or PasteBin.

so that the burden on the administrator is very minimal.

No sysadmin worth a damn is going to not spend a LOT of time thinking whether this sort of thing is even rational, and if so, where they want to draw the line. This sort of behavior doesn't suffer fools gladly nor is it appropriate for people to attempt who don't first know what they are doing. Some parts of Linux/Unix are Experts-Only for a reason.

matthew patton

7 years ago

From the two proposed solutions (lvremove vs lverror), I think I would prefer the second one.

I vote the other way. :)
First because 'remove' maps directly to the DM equivalent action which brought this about. Second because you are in fact deleting the object - ie it's not coming back. That it returns a nice and timely error code up the stack instead of the kernel doing 'wierd things' is an implementation detail.

Not to say 'lverror' might have a use of it's own as a "mark this device as in an error state and return EIO on every OP". Which implies you could later remove the flag and IO could resume subject to the higher levels not having already wigged out in some fashion. However why not change the behavior of 'lvchange -n' to do that on it's own on a previously activated entry that still has a ref count > 0? With '--force' of course

With respect to freezing or otherwise stopping further I/O to LV being used by virtual machines, the only correct/sane solution is one of 'power off' or 'suspend'. Reaching into the VM to freeze individual/all filesystems but otherwise leave the VM running assumes significant knowledge of the VM's internals and the luxury of time.

Zdenek Kabelac

7 years ago

Post by matthew patton

From the two proposed solutions (lvremove vs lverror), I think I would prefer the second one.

It's not that easy.

lvm2 cannot just 'lose' the volume which is still mapped IN table (even if it
will an error segment)

So the result of operation will be some 'LV' in the lvm2 metadata.
which could be possibly flagged for 'automatic' removal later once it's no
longer hold in use.

There could be seen 'some' similarity between snapshot marge - where lvm2
also maintains some 'fictional' volumes internally..

So 'lvm2' could possibly 'mask' device as 'removed' - or it can keep it
remapped to error target - which could be possibly usable for other things.

Post by matthew patton
Not to say 'lverror' might have a use of it's own as a "mark this device as in an error state and return EIO on every OP". Which implies you could later remove the flag and IO could resume subject to the higher levels not having already wigged out in some fashion. However why not change the behavior of 'lvchange -n' to do that on it's own on a previously activated entry that still has a ref count > 0? With '--force' of course

'lvrerror' can be also used for 'lvchange -an' - so not such 'lvremoval' and
it could be used for other volumes (not just things) possibly -

so you get and lvm2 mapping of 'dmsetup wipe_table'

('lverror' would be actually something like 'lvconvert --replacewitherror'
- likely we would not add a new 'extra' command for this conversion)

Post by matthew patton
With respect to freezing or otherwise stopping further I/O to LV being used by virtual machines, the only correct/sane solution is one of 'power off' or 'suspend'. Reaching into the VM to freeze individual/all filesystems but otherwise leave the VM running assumes significant knowledge of the VM's internals and the luxury of time.

And 'suspend' can be dropped from this list ;) as so far lvm2 is seeing a
device left in suspend after command execution as a serious internal error,
and there is long list of good reasons for not leaking suspend devices.

Suspend is designed as short-living 'state' of device - it's not meant to be
held suspend for undefined amount of time - it cause lots of troubles to
various /dev scanning softwares (lvm2 included....) - and as such it's has
racy built-in :)

Regards

Zdenek

Xen

7 years ago