Discussion:
[linux-lvm] thin handling of available space
Xen
2016-04-23 17:53:03 UTC
Permalink
Hi,

So here is my question. I was talking about it with someone, who also
didn't know.



There seems to be a reason against creating a combined V-size that
exceeds the total L-size of the thin-pool. I mean that's amazing if you
want extra space to create more volumes at will, but at the same time
having a larger sum V-size is also an important use case.

Is there any way that user tools could ever be allowed to know about the
real effective free space on these volumes?

My thinking goes like this:

- if LVM knows about allocated blocks then it should also be aware of
blocks that have been freed.
- so it needs to receive some communication from the filesystem
- that means the filesystem really maintains a "claim" on used blocks,
or at least notifies the underlying layer of its mutations.

- in that case a reverse communication could also exist where the block
device communicates to the file system about the availability of
individual blocks (such as might happen with bad sectors) or even the
total amount of free blocks. That means the disk/volume manager (driver)
could or would maintain a mapping or table of its own blocks. Something
that needs to be persistent.

That means the question becomes this:

- is it either possible (theoretically) that LVM communicates to the
filesystem about the real number of free blocks that could be used by
the filesystem to make "educated decisions" about the real availability
of data/space?

- or, is it possible (theoretically) that LVM communicates a "crafted"
map of available blocks in which a certain (algorithmically determined)
group of blocks would be considered "unavailable" due to actual real
space restrictions in the thin pool? This would seem very suboptimal but
would have the same effect.

See if the filesystem thinks it has 6GB available but really there is
only 3GB because data is filling up, does it currently get notified of
this?

What happens if it does fill up?

Funny that we are using GB in this example. I remembered today using
Stacker on MS-DOS disk where I had 20MB available and was able to
increase it to 30MB ;-).

Someone else might use terabytes, but anyway.

If the filesystem normally has a fixed size and this size doesn't change
after creation (without modifying the filesystem) then it is going to
calculate its free space based on its knowledge of available blocks.

So there are three figures:

- total available space
- real available space
- data taken up by files.

total - data is not always real, because there may still be handles on
deleted files, etc., open. Visible, countable files and its "du" +
blocks still in use + available blocks should be ~ total blocks.

So we are only talking about blocks here, nothing else.

And if LVM can communicate about availability of blocks, a fourth figure
comes into play:

total = used blocks + unused blocks + unavailable blocks.

If LVM were able to dynamically adjust this last figure, we might have a
filesystem that truthfully reports actual available space. In a thin
setting.

I do not even know whether this is not already the case, but I read
something that indicated an importance of "monitoring available space"
which would make the whole situation unusable for an ordinary user.

Then you would need GUI applets that said "The space on your thin volume
is running out (but the filesystem might not report it)".

So question is:

* is this currently 'provisioned' for?
* is this theoretically possible, if not?

If you take it to a tool such as "df"

There are only three figures and they add up.

They are:

total = used + available

but we want

total = used + available + unavailable

either that or the total must be dynamically be adjusted, but I think
this is not a good solution.


So another question:

*SHOULDN'T THIS simply be a feature of any filesystem?*

The provision of being able to know about the *real* number of blocks in
case an underlying block device might not be "fixed, stable, and
unchanging"?

The way it is you can "tell" Linux filesystems with fsck which blocks
are bad blocks and thus unavailable, probably reducing the number of
"total" blocks.

From a user interface perspective, perhaps this would be an ideal
solution, if you needed any solution at all. Personally I would probably
prefer either the total space to be "hard limited" by the underlying
(LVM) system, or for df to show a different output, but df output is
often parsed by scripts.

In the former case supposing a volume was filling up.

udev 1974288 0 1974288 0% /dev
tmpfs 404384 41920 362464 11% /run
/dev/sr2 1485120 1485120 0 100% /cdrom

(Just taking 3 random filesystems)

One filesystem would see "used" space go up. The other two would see
"total" size going down, in addition to the other one, also seeing that
figure go down. That would be counterintuitive and you cannot really do
this.

It's impossible to give this information to the user in a way that the
numbers still add up.

Supposing:

real size 2000

1000 500 500
1000 500 500
1000 500 500

combined virtual size 3000. Total usage 1500. Real free 500. Now the
first volume uses another 250.

1000 750 250
1000 500 250
1000 500 250

The numbers no longer add up for the 2nd and 3rd system.

You *can* adjust total in a way that it still makes sense (a bit)

1000 750 250
750 500 250
750 500 250

You can also just ignore the discrepancy, or add another figure:

total used unav avail
1000 750 0 250
1000 500 250 250
1000 500 250 250

Whatever you do, you would have to simply calculate this adjusted number
from the real number of available blocks.

Now the third volume takes another 100

First style:

1000 750 150
1000 500 150
1000 600 150

Second style:

1000 750 150
650 500 150
750 600 150

Third style:

total used unav avail
1000 750 100 150
1000 500 350 150
1000 600 250 150

There's nothing technically inconsistent about it, it is just rather
difficult to grasp at first glance.

df uses filesystem data, but we are really talking about
block-layer-level-data now.

You would either need to communicate the number of available blocks (but
which ones?) and let the filesystem calculate unavailable --- or
communicate the number of unavailable blocks at which point you just do
this calculation yourself. For each volume you reach a different number
of "blocks" you need to withhold.

If you needed to make those blocks unavailable, you would now randomly
(or at the end of the volume, or any other method) need to "unavail"
those to the filesystem layer beneath (or above).

Every write that filled up more blocks would be communicated to you,
(since you receive the write or the allocation) and would result in an
immediate return of "spurious" mutations or an updated number of
unavailable blocks -- and you can also communicate both.

On every new allocation, the filesystem would be returned blocks that
you have "fakely" marked as unavailable. All of this only happens if
available real space becomes less than that of the individual volumes
(virtual size). The virtual "available" minus the "real available" is
the number of blocks (extents) you are going to communicate as being
"not there".

At every mutation from the filesystem, you respond with a like mutation:
not to the filesystem that did the mutation, but to every other
filesystem on every other volume.

Space being freed (deallocated) then means a reverse communication to
all those other filesystems/volumes.

But it would work, if this was possible. This is the entire algorithm.


I'm sorry if this sounds like a lot of "talk" and very little "doing"
and I am annoyed by that as well. Sorry about that. I wish I could
actually be active with any of these things.

I am reminded of my father. He was in school for being a car mechanic
but he had a scooter accident days before having to do his exam. They
did the exam with him in a (hospital) bed. He only needed to give
directions on what needed to be done and someone else did it for him :p.

That's how he passed his exam. It feels the same way for me.

Regards.
Xen
2016-04-27 12:01:26 UTC
Permalink
I was talking about the idea to communicate to a filesystem the amount
of available blocks.

I noticed https://bugzilla.redhat.com/show_bug.cgi?id=1189215 named "LVM
Thin: Handle out of space conditions better" which was resolved by
Zdenek Kabelac (hey Zdenek) and which gave rise to (apparently) the new
warning you get when you overprovision.



But this warning when overprovisioning does not solve any problems in a
running system.

You /still/ want to overprovision AND you want a better way to handle
out of space conditions.

A number of items were suggested in that bug:

1) change the default "resize thin-p at 100%" setting in lvm.conf
2) warn users that they have insufficient space in a pool to cover a
fully used thinLV
3) change default wait time from 60sec after an out-of-space condition
to something longer

Corey Marthaler suggested that only #2 was implemented, and this bug (as
mentioned) was linked in an errata mentioned at the end of the bug.


So since I have already talked about it here with my lengthy rambling
post ;-) I would like to at least here "formally" suggest a #4 and ask
whether I should comment on that bug or supply/submit a new one about
it?


So my #4 would be:

4) communicate and dynamically update a list of free blocks being sent
to the filesystem layer on top of a logical volume (LV) such that the
filesystem itself is aware of shrinking free space.

Logic implies:
- any thin LV seeing more blocks being used causes the other filesystems
in that thin pool to be updated with new available blocks (or numbers)
if this amount becomes less than the filesystem normally would think it
had

- any thin LV that sees blocks being discarded by the filesystem causes
the other filesystems in that thin pool to be updated with newly
available blocks (or numbers) op to the moment that the real available
space agrees once more with the virtual available space (real free >=
virtual free)

Meaning that this feedback would start happening for any thin LV when
the real available space in the thin pool cq. volume group (depending on
how that works at that point in that place in that configuration)
becomes less then the virtual available space for the thin volume (LV)

This would mean that the virtual available space would in effect
dynamically shrink and grow with the real available space as an
envelope.

The filesystem may know this as an adjusted total available space
(number of blocks) or as an adjusted number of unavailable blocks. It
would need to integrate this in its free space calculation. For a user
tool such as "df" there are 3 ways to update this changing information:

1. dynamically adjust the total available blocks
2. dynamically adjust the amount of free blocks
3. introduce a new field of "unavailable"

Traditional "df" is "total = used + free", the new one would be "total =
used + free + unavailable".

For any user tool not working in blocks but simply available space
(bytes) likely only the amount of free space being reported, would
change.

One may choose to hide the information in "df" and introduce a new flag
that shows unavailable as well?

Then only the amount of free blocks reported, would change, and the
numbers just don't add up visibly.

It falls along the line of the "discard" family of communications that
were introduced in 2008 (https://lwn.net/Articles/293658/).

I DO NOT KNOW if this already exists but I suppose it doesn't. I do not
know a lot about the filesystem system. I just took the liberty to ask
Jonathan Corwell erm Corbet whether this is possible :p.

Anyway, hopefully I am not being too much of a pain here. Regards.
matthew patton
2016-04-27 12:26:57 UTC
Permalink
It is not the OS' responsibility to coddle stupid sysadmins. If you're not watching for high-water marks in FS growth vis a vis the underlying, you're not doing your job. If there was anything more than the remotest chance that the FS would grow to full size it should not have been thin in the first place.

The FS already has a notion of 'reserved'. man(1) tune2fs -r
Xen
2016-04-27 21:28:31 UTC
Permalink
Post by matthew patton
It is not the OS' responsibility to coddle stupid sysadmins. If you're
not watching for high-water marks in FS growth vis a vis the
underlying, you're not doing your job. If there was anything more than
the remotest chance that the FS would grow to full size it should not
have been thin in the first place.
Who says the only ones who would ever use or consider using thin would
be sysadmins.?

Monitoring Linux is troublesome enough for most people and it really is
a "job".

You seem to be intent on making the job harder rather than easy so you
can be a type of person that has this expert knowledge while others
don't?

I remember a reason to crack down on sysadmins was that they didn't know
how to use "vi" - if you can't use fucking vi, you're not a sysadmin.
This actually is a bloated version of what a system administrator is or
could at all times be expected to do, because you are ensuring that
problems are going to surface one way or another when this sysadmin is
suddenly no longer capable of being this perfect guy at 100% of times.

You are basically ensuring disaster by having that attitude.

That guy that can battle against all odds and still prevail ;-).

More to the point.

No one is getting cuddled because Linux is hard enough and it is usually
the users who are getting cuddled; strangely enough the attitude exists
that the average desktop user never needs to look under the hood. If
something is ugly, who cares, the "average user" doesn't go there.

The average user is oblivious to all system internals.

The system administrator knows everything and can launch a space rocket
with nothing more than matches and some gallons of rocket fuel.

;-).


The autoextend mechanism is designed to prevent calamity when the
filesystem(s) grow to full size. By your reasoning , it should not exist
because it cuddles admins.

A real admin would extend manually.

A real admin would specify the right size in advance.

A real admin would use thin pools of thin pools that expand beyond your
wildest dreams :p.

But on a more serious note, if there is no chance a file system will
grow to full size, then it doesn't need to be that big.

But there are more use cases for thin than hosting VMs for clients.

Also I believe thin pools have a use for desktop systems as well, when
you see that the only alternative really is btrfs and some distros are
going with it full-time. Btrfs also has thin provisioning in a sense but
on a different layer, which is why I don't like it.

Thin pools from my perspective are the only valid snapshotting mechanism
if you don't use btrfs or zfs or something of the kind.

Even a simple desktop monitor, some applet with configured thin pool
data, would of course alleviate a lot of the problems for a "casual
desktop user". If you remotely administer your system with VNC or the
like, that's the same. So I am saying there is no single use case for
thin, and.

Your response mr. patton falls along the lines of "I only want this to
be used by my kind of people".

"Don't turn it into something everyone or anyone can use".

"Please let it be something special and nichie".

You can read coddle in place of cuddle.



It seems to me pretty clear to me that a system that *requires* manual
intervention and monitoring at all times is not a good system,
particularly if the feedback on its current state cannot be retrieved
from, or is usable by, other existing systems that guard against more or
less the same type of things.

Besides, if your arguments here were valid, then
https://bugzilla.redhat.com/show_bug.cgi?id=1189215 would never have
existed.
Post by matthew patton
The FS already has a notion of 'reserved'. man(1) tune2fs -r
Alright thanks. But those blocks are manually reserved for a specific
user.

That's what they are for. It is for -u. These blocks are still available
to the filesystem.

You could call it calamity prevention as well. There will always be a
certain amount of space for say the root user.

and by the same measure you can also say the tmpfs overflow mechanism
for /tmp is not required either because a real admin would not see his
rootfs go out of diskspace.

Stuff happens. You ensure you are prepared when it does. Not stick your
head in the sand and claim that real gurus never encounter those
situations.

The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.

Or whether that is going to be a detriment.

Regards.



Erratum:

https://utcc.utoronto.ca/~cks/space/blog/tech/SocialProblemsMatter

There is a widespread attitude among computer people that it is a great
pity that their beautiful solutions to difficult technical challenges
are being prevented from working merely by some pesky social issues
[read: human flaws], and that the problem is solved once the technical
work is done. This attitude misses the point, especially in system
administration: broadly speaking, the technical challenges are the easy
problems.

No technical system is good if people can't use it or if it makes
people's lives harder (my words). One good example of course is Git. The
typical attitude you get is that a real programmer has all the skills of
a git guru. Yet git is a git. Git is an asshole system.

Beside the point here perhaps. But. Let's drop the "real sysadmin"
ideology. We are humans. We like things to work for us. "Too easy" is
not a valid criticism for not having something.
Marek Podmaka
2016-04-28 06:46:35 UTC
Permalink
Hello Xen,
Post by Xen
The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.
Beside the point here perhaps. But. Let's drop the "real sysadmin"
ideology. We are humans. We like things to work for us. "Too easy" is
not a valid criticism for not having something.
As far as I know (someone correct me) there is no mechanism at all in
kernel for communication from lower fs layers to higher layers -
besides exporting static properties like physical block size. The
other way (from higher layer like fs to lower layers works fine - for
example discard support).

So even if what you are asking might be valid, it isn't as simple as adding
some parameter somewhere and it would magically work. It is about
inventing and standardizing new communication system, which would of
course work only with new versions of all the tools involved.

Anyway, I have no idea what would filesystem itself do with information
that no more space is available. Also this would work only for lvm
thin pools, not for thin provision directly from storage, so it would
be a non-consistent mess. Or you would need another protocol for
exporting thin-pool related dynamic data from storage (via NAS, SAN,
iSCSI and all other protocols) to the target system. And in some
organizations it is not desirable at all to make this kind of
information visible to all target systems / departments.

What you are asking can be done for example directly in "df" (or you
can make a wrapper script), which would not only check the filesystems
themselves, but also the thin part and display the result in whatever
format you want.

Also displaying real thin free space for each fs won't be "correct".
If I see 1 TB free in each filesystem and starting writing, by the
time I finish, those 1 TB might be taken by the other fs. So
information about current free space in thinp is useless for me, as in
1 minute it could be totally different number.
--
bYE, Marki
Xen
2016-04-28 10:33:03 UTC
Permalink
Post by Marek Podmaka
Hello Xen,
Post by Xen
The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.
Beside the point here perhaps. But. Let's drop the "real sysadmin"
ideology. We are humans. We like things to work for us. "Too easy" is
not a valid criticism for not having something.
As far as I know (someone correct me) there is no mechanism at all in
kernel for communication from lower fs layers to higher layers -
besides exporting static properties like physical block size. The
other way (from higher layer like fs to lower layers works fine - for
example discard support).
I suspected so.
Post by Marek Podmaka
So even if what you are asking might be valid, it isn't as simple as adding
some parameter somewhere and it would magically work. It is about
inventing and standardizing new communication system, which would of
course work only with new versions of all the tools involved.
Right.
Post by Marek Podmaka
Anyway, I have no idea what would filesystem itself do with information
that no more space is available. Also this would work only for lvm
thin pools, not for thin provision directly from storage, so it would
be a non-consistent mess. Or you would need another protocol for
exporting thin-pool related dynamic data from storage (via NAS, SAN,
iSCSI and all other protocols) to the target system. And in some
organizations it is not desirable at all to make this kind of
information visible to all target systems / departments.
Yes I don't know how "thin provision directly from storage" works.

I take it you mean that these protocols you mention are or would be the
channel through which the communication would need to happen that I now
just proposed for LVM.

I take it you mean that these systems offer regular looking devices over
any kind of link, while "secretly" behind the scenes using thin
provisioning for that, and that as such we are or would be dealing with
pretty "hard coded" standards that would require a lot of momentum to
change any of that. In that sense that the client of these storage systems
themselves do not know about the thin provisioning and it is up to the
admin of those systems.,.. yadda yadda yadda.

I feel really stupid now :p.

And to make it worse, is means that in these "hardware" systems the user
and admin are separated, but the same is true if you virtualize and you
offer the same model to your clients. I apologize for my noviceness here
the way I come across.

But I agree that to any client it is not helpful to know about hard limits
that should be oblivious to them provided that the provisioning is done
right.

It would be quite disconcerting to see your total available space suddenly
shrink without being aware of any autoextend mechanism (for instance) and
as such there seems to be a real divide between the "user" and the
"supplier" of any thin volume.

Maybe I have misinterpreted the real use case for thin pools then. But my
feeling is that I am just a bit confused at this point.
Post by Marek Podmaka
What you are asking can be done for example directly in "df" (or you
can make a wrapper script), which would not only check the filesystems
themselves, but also the thin part and display the result in whatever
format you want.
That is true of course. I have to think about it.
Post by Marek Podmaka
Also displaying real thin free space for each fs won't be "correct".
If I see 1 TB free in each filesystem and starting writing, by the
time I finish, those 1 TB might be taken by the other fs. So
information about current free space in thinp is useless for me, as in
1 minute it could be totally different number.
But the calamity is that if that was really true, and the thing didn't
autoextend, then you'd end up with a frozen system.

So basically it seems at this point a conflict of interests:

- you don't want your clients to know your systems are failing
- they might not even be failing if they autoextend
- you don't want to scare them with in that sense, inaccurate data

- on a desktop system, the user and sysadmin would be the same
- there is not really any provison for graphical tools.

(maybe I should develop one. I so badly want to start coding again).

- a tool that notifies the user about the thin pool would equally well do
the job of informing the user/admin as a filesystem point of data, would
do.

- that implies that the two roles would stay separate.
- desktops seem to be using btrfs now in some distros

I'm concerned with the use case of a desktop user that could employ this
technique. I now understand a bit more perhaps why grub doesn't support
LVM thin.

The management tools for a desktop user also do not exist (except the
command line tools we have).

Well wrong again there is a GUI it is just not very helpful.

It is not helpful at all for monitoring.

It can
* create logical volumes (regular, stripe, mirror)
* move volumes to another PV
* extend volume groups to another PV

And that's about all it can do I guess. Not sure it even needs to do much
more, but it is no monitoring tool of any sophistication.

Let me think some more on this and I apologize for the "out loud"
thinking.

Regards.
matthew patton
2016-04-28 10:43:50 UTC
Permalink
Post by Marek Podmaka
Post by Xen
The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.
Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.
Post by Marek Podmaka
kernel for communication from lower fs layers to higher layers -
Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.

A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on thin-pool were set to breach actual block availability the FS growth would be denied and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.

Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.

But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled.

Think of mindless use of thin-pools as trying to cross a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.
Xen
2016-04-28 18:25:55 UTC
Permalink
Continuing from previous mail I guess. But I realized something.
Post by matthew patton
A responsible sysadmin who chose to use thin pools might configure the
initial FS size to be some modest size well within the constraints of
the actual block store, and then as the FS hit say 85% utilization to
run a script that investigated the state of the block layer and use
resize2fs and friends to grow the FS and let the thin-pool likewise
grow
to fit as IO gets issued. But at some point when the competing demands
of other FS on thin-pool were set to breach actual block availability
the FS growth would be denied and thus userland would get signaled by
the FS layer that it's out of space when it hit 100% util.
Well of course what you describe here are increasingly complex
strategies
that require development and should not be put on invidual
administrators
(or even organisations) to devise and come up with.

Growing filesystems? If you have a platform where continous thin pool
growth is possible (and we are talking of well developed, complex setups
here) then maybe you have in-house tools to take care of all of that.

So you suggest a strategy here that involves both intelligent automatic
administration of the FS layer as well as the block layer.

A concerted strategy where for example you do have a defined thin volume
size but you constrain your FS artificially AND depend its intelligence
on
knowledge of your thin pool size. And then you have created an
intelligence where the "filesystem agent" can request growth, and
perhaps
the "block level agent" may grant or deny it such that FS growth is
staged
and given hard limits at every point. And then you have the same
functionality as what I described other than that it is more sanely
constructed at intervals.

No continuous updating, but staged growth intervals or moments.
Post by matthew patton
But either way if you have a sudden burst of I/O from competing
interests in the thin-pool, what appeared to be a safe growth
allocation
at one instant of time is not likely to be true when actual writes try
to get fulfilled.
So in the end monitoring is important but because you use a thin pool
there are like 3 classes of situations that change:

* Filesystems will generally have more leeway because you are /able/ to
provide them with more (virtual) space to begin with, in the assumption
that you won't readily need it, but it's normally going to be there when
it does.

* Hard limits in the filesystem itself is still a use case that has no
good solution; most applications will start crashing or behaving weirdly
when out of diskspace. Freezing a filesystem (when it is not a system
disk) might be equally well of a good mitigation strategy as anything
that
involves "oh no, I am out of diskspace and now I am going to ensure
endless trouble as processes keep trying to write to that empty space -
that nonexistent space". If anything I don't think most systems
gracefully
recover from that.

Creating temporary filesystems for important parts is not all that bad.

* Thin volumes do allow you to make better use of the available space
(as
per btrfs, I guess) and give many advantages in moving data around.

The only detriment really to thin for a desktop power user, so to speak
is:

1. Unless you monitor it directly in some way, the lack of information
is
going to make you feel rather annoyed and insecure

2. Normally user tools do inform you of system status (a user-run "ls"
or
"df" is enough) but you cannot have lvs information unless run as root.

The system-config-lvm tool just runs as setuid. I can add volumes
without
authenticating as root.

Regular command line tools are not accessible to the user.


So what I have been suggesting obviously seeks to address point 2. I am
more than willing to address point 1 by developing something, but I'm
not
sure I will ever be able to develop again in this bleak sense of decay I
am experiencing life to be currently ;-).

Anyhow, it would never fully satisfy for me.

Even with a perfect LVM monitoring tool, I would experience a consistent
lack of feedback.

Just a simple example: I can adjust "df" to do different stuff. But any
program reporting free diskspace is going to "lie" to me in that sense.
So
yes I've chosen to use thin LVM because it is the best solution for me
right now.

At the same time indeed, I lack information and this information cannot
be
sourced directly from the block layer because that's not how computer
software works. Computer software doesn't interface with the block
layer.
They interface with filesystems and report information from there.

Technically I consider autoextend not that great of a solution either.

It begs the question: why did you not start out with a larger volume in
the first place? You going to keep adding disks as the thing grows?

I mean, I don't know. If I'm some VPS user and I'm running on a
thinly-provisioned host. Maybe it's nice to be oblivious. But unless my
host has a perfect failsafe setup, the only time I am going to be
notified
of failure is if my volume (that I don't know about) drops or freezes.

Would I personally like having a tool that would show at some point
something going wrong at the lower level? I think I would.

An overprovisioned system with individual volumes that individually
cannot
reach their max size is a bad system.

That they can't do it all at the same time is not that much of a
problem.
That is not very important.

Yet considering a different situation -- suppose this is a host with few
clients but high data requirements. Suppose there are only 4 thin
volumes.
And suppose every thin volume is going to be something of 2TB or make it
anything as large as you want.

(I just have 50GB on my vps). Suppose you had a 6TB disk and you
provisioned it for 4 clients x 2TB. Economies of scale only start to
really show their benefit with much higher number of clients. With 200
clients the "averaging" starts to work in your favour giving you a
dependable system that is not going to suddenly do something weird.

But with smaller numbers you do run into the risk of something going
amiss.

The only reason lack of feedback would not be important for your clients
is if you had a large enough pool, and individual volumes would be just
a
small part of that pool, say 50-100 volumes per pool.

So I guess I'm suggesting there may be a use case for thin LVM in which
you do not have this >10 number of volumes sitting in any pool.

And at that point personally even if I'm the client of that system, I do
want to be informed.

And I would prefer to be informed *through* the pipe that already
exists.

Thin pools lie. Yes. But it's not a lie of the space is available. It's
only a lie if the space is no longer available!!!!!!!.

It is not designed to lie.
Zdenek Kabelac
2016-04-29 11:23:13 UTC
Permalink
Post by Xen
Continuing from previous mail I guess. But I realized something.
Post by matthew patton
A responsible sysadmin who chose to use thin pools might configure the
initial FS size to be some modest size well within the constraints of
the actual block store, and then as the FS hit say 85% utilization to
run a script that investigated the state of the block layer and use
resize2fs and friends to grow the FS and let the thin-pool likewise grow
to fit as IO gets issued. But at some point when the competing demands
of other FS on thin-pool were set to breach actual block availability
the FS growth would be denied and thus userland would get signaled by
the FS layer that it's out of space when it hit 100% util.
Well of course what you describe here are increasingly complex strategies
that require development and should not be put on invidual administrators
(or even organisations) to devise and come up with.
Growing filesystems? If you have a platform where continous thin pool
growth is possible (and we are talking of well developed, complex setups
here) then maybe you have in-house tools to take care of all of that.
So you suggest a strategy here that involves both intelligent automatic
administration of the FS layer as well as the block layer.
A concerted strategy where for example you do have a defined thin volume
size but you constrain your FS artificially AND depend its intelligence on
knowledge of your thin pool size. And then you have created an
intelligence where the "filesystem agent" can request growth, and perhaps
the "block level agent" may grant or deny it such that FS growth is staged
and given hard limits at every point. And then you have the same
functionality as what I described other than that it is more sanely
constructed at intervals.
No continuous updating, but staged growth intervals or moments.
I'm not going to add much to this thread - since there is nothing really
useful for devel. But let me strike out few important moments:


Thin-provisioning is NOT about providing device to the upper
system levels and inform THEM about this lie in-progress.

That's complete misunderstanding of the purpose.

If you seek for a filesystem with over-provisioning - look at btrfs, zfs and
other variants...

Device target is definitely not here to solve filesystem troubles.
Thinp is about 'promising' - you as admin promised you will provide
space - we could here discuss maybe that LVM may possibly maintain
max growth size we can promise to user - meanwhile - it's still the admin
who creates thin-volume and gets WARNING if VG is not big enough when all thin
volumes would be fully provisioned.

And THAT'S IT - nothing more.

So please avoid making thinp target to be answer to ultimate question of life,
the universe, and everything - as we all know it's 42...
Post by Xen
Post by matthew patton
But either way if you have a sudden burst of I/O from competing
interests in the thin-pool, what appeared to be a safe growth allocation
at one instant of time is not likely to be true when actual writes try
to get fulfilled.
So in the end monitoring is important but because you use a thin pool
* Filesystems will generally have more leeway because you are /able/ to
provide them with more (virtual) space to begin with, in the assumption
that you won't readily need it, but it's normally going to be there when
it does.
So you try to design 'another btrfs' on top of thin provisioning?
Post by Xen
* Thin volumes do allow you to make better use of the available space (as
per btrfs, I guess) and give many advantages in moving data around.
With 'thinp' you want simplest filesystem with robust metadata - so in
theory - 'ext4' or XFS without all 'improvements for rotational hdd that
has accumulated over decades of their evolution.
Post by Xen
1. Unless you monitor it directly in some way, the lack of information is
going to make you feel rather annoyed and insecure
2. Normally user tools do inform you of system status (a user-run "ls" or
"df" is enough) but you cannot have lvs information unless run as root.
You miss the 'key' details.

Thin pool is not constructing 'free-maps' for each LV all the time - that's
why tools like 'thin_ls' are meant to be used from the user-space.
It IS very EXPENSIVE operation.

So before you start to present your visions here, please spend some time with
reading doc and understanding all the technology behind it.
Post by Xen
Even with a perfect LVM monitoring tool, I would experience a consistent
lack of feedback.
Mistake of your expectations

If you are trying to operate thin-pool near 100% fullness - you will need to
write and design completely different piece of software - sorry thinp
is not for you and never will...

Simply use 'fully' provisioned - aka - already existing standard volumes.
Post by Xen
Just a simple example: I can adjust "df" to do different stuff. But any
program reporting free diskspace is going to "lie" to me in that sense. So
yes I've chosen to use thin LVM because it is the best solution for me
right now.
'df' has nothing in common with 'block' layer.
Post by Xen
Technically I consider autoextend not that great of a solution either.
It begs the question: why did you not start out with a larger volume in
the first place? You going to keep adding disks as the thing grows?
Very simple answer and related of to misunderstanding of the purpose.

Take it as motivation like you want to reduce amount of active device in your
i.e. 'datacenter'.

So you start with 1TB volume - while the user may immediately create and
format and use i.e. 10TB volume. As the volume fill over the time - you add
more devices to your vg (buy/pay for more disk space/energy).
But user doesn't have to resize his filesystem or have other costs with
maintenance of slowly growing filesystem.

Of course if the first thing user will do is to i.e. 'dd' full 10TB volume
the are not going to be any savings!

But if you've never planned to buy 10TB - you should have never allow to
create such big volume in the first place!

With thinp you basically postpone or skip (fsresize) some operations.
Post by Xen
An overprovisioned system with individual volumes that individually cannot
reach their max size is a bad system.
Yes - it is bad system.

So don't do it - and don't plan to use it - it's really that simple.

ThinP is NOT virtual disk-space for free...
Post by Xen
Thin pools lie. Yes. But it's not a lie of the space is available. It's
only a lie if the space is no longer available!!!!!!!.
It is not designed to lie.
Actually it's the core principle!
It lies (or better say uses admin's promises) that there is going to be a disk
space. And it's admin responsibility to fulfill it.

If you know in front you will need quickly all the disk space - then using
thinp and expecting miracle is not going to work.


Regards

Zdenek
Mark Mielke
2016-05-02 14:32:26 UTC
Permalink
Post by Zdenek Kabelac
Thin-provisioning is NOT about providing device to the upper
system levels and inform THEM about this lie in-progress.
That's complete misunderstanding of the purpose.
I think this line of thought is a bit of a strawman.

Thin provisioning is entirely about presenting the upper layer with a
logical view which does not match the physical view, including the
possibility for such things as over provisioning. How much of this detail
is presented to the higher layer is an implementation detail and has
nothing to do with "purpose". The purpose or objective is to allow volumes
that are not fully allocated in advance. This is what "thin" means, as
compared to "thick".
Post by Zdenek Kabelac
If you seek for a filesystem with over-provisioning - look at btrfs, zfs
and other variants...
I have to say that I am disappointed with this view, particularly if this
is a view held by Red Hat. To me this represents a misunderstanding of the
purpose for over-provisioning, and a misunderstanding of why thin volumes
are required. It seems there is a focus on "filesystem" in the above
statement, and that this may be the point of debate.

When a storage provider providers a block device (EMC, NetApp, ...) and a
snapshot capability, I expect to be able to take snapshots with low
overhead. The previous LVM model for snapshots was really bad, in that it
was not low overhead. We use this capability for many purposes including:

1) Instantiating test environments or dev environments from a snapshot of
production, with copy-on-write to allow for very large full-scale
environments to be constructed quickly and with low overhead. In one of our
examples, this includes an example where we have about 1 TByte of JIRA and
Confluence attachments collected over several years. It is exposed over NFS
by the NetApp device, but in the backend it is a volume. This volume is
snapshot and then exposed as a different volume with copy-on-write
characteristics. The storage allocation is monitored, and if it is
exceeded, it is known that there will be particular behaviour. I believe in
our case, the behaviour is that the snapshot becomes unusable.

2) Frequent snapshots. In many of our use cases, we may take snapshots
every 15 minutes, every hour, and every day, keeping 3 or more of each. If
this storage had to be allocated in full, this amounts to at least 10X the
storage cost. Using snapshots, and understanding the rate of churn, we can
use closer to 1X or 2X the storage overhead, instead of 10X the storage
overhead.

3) Snapshot as a means of achieving a consistent backup at low cost of
outage or storage overhead. If we "quiesce" the application (flush buffers,
put new requests on hold, etc.) take the snapshot, and then "resume" the
application, this can be achieved in a matter of seconds or less. Then, we
can mount the snapshot at a separate mount point and proceed with a more
intensive backup process against a particular consistent point-in-time.
This can be fast and require closer to 1X the storage overhead, instead of
2X the storage overhead.

In all of these cases - we'll buy more storage if we need more storage.
But, we're not going to use BTRFS or ZFS to provide the above capabilities,
just because this is your opinion on the matter. Storage vendors of
reputation and market presence sell these capabilities as features, and we
pay a lot of money to have access to these features.

In the case of LVM... which is really the point of this discussion... LVM
is not necessarily going to be used or available on a storage appliance.
The LVM use case, at least for us, is for storage which is thinly
provisioned by the compute host instead of the backend storage appliance.
This includes:

1) Local disks, particularly included local flash drives that are local to
achieve higher levels of performance than can normally be achieved with a
remote storage appliance.

2) Local file systems, on remote storage appliances, using a protocol such
as iSCSI to access the backend block device. This might be the case where
we need better control of the snapshot process, or to abstract the
management of the snapshots from the backend block device. In our case, we
previously use an EMC over iSCSI for one of these use cases, and we are
switching to NetApp. However, instead of embedding NetApp-specific logic
into our code, we want to use LVM on top of iSCSI, and re-use the LVM thin
pool capabilities from the host, such that we don't care what storage is
used on the backend. The management scripts will work the same whether the
storage is local (the first case above) or not (the case we are looking
into now).

In both of these cases, we have a need to take snapshots and manage them
locally on the host, instead of managing them on a storage appliance. In
both cases, we want to take many light weight snapshots of the block
device. You could argue that we should use BTRFS or ZFS, but you should
full well know that both of these have caveats as well. We want to use XFS
or EXT4 as our needs require, and still have the ability to take
light-weight snapshots.

Generally, I've seen the people who argue that thin provisioning is a
"lie", tend to not be talking about snapshots. I have a sense that you are
talking more as storage providers for customers, and talking more about
thinly provisioning content for your customers. In this case - I think I
would agree that it is a "lie" if you don't make sure to have the storage
by the time it is required. But, I think this is a very small use case in
reality. I think large service providers would use Ceph or EMC or NetApp,
or some such technology to provision large amounts of storage per customer,
and LVM would be used more at the level of a single customer, or a single
machine. In these cases, I would expect that LVM thin volumes should not be
used across multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation that would be
required. In the case of our IT team and EMC or NetApp, they mostly avoid
the use of thin volumes for "cross customer" purposes, and instead use thin
volumes for a specific customer, for a specific need. In the case of Amazon
EC2, for example... I would use EBS for storage, and expect that even if it
is "thin", Amazon would make sure to have enough storage to meet my
requirement if I need them. But, I would use LVM on my Amazon EC2 instance,
and I would expect to be able to use LVM thin pool snapshots to over
provision my own per-machine storage requirements by creating multiple
snapshots of the underlying storage, with a full understanding of the
amount of churn that I expect to occur, and a full understanding of the
need to monitor.
Post by Zdenek Kabelac
Device target is definitely not here to solve filesystem troubles.
Thinp is about 'promising' - you as admin promised you will provide
space - we could here discuss maybe that LVM may possibly maintain
max growth size we can promise to user - meanwhile - it's still the admin
who creates thin-volume and gets WARNING if VG is not big enough when all
thin volumes would be fully provisioned.
And THAT'S IT - nothing more.
So please avoid making thinp target to be answer to ultimate question of
life, the universe, and everything - as we all know it's 42...
The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block device
visible to the file system layer, and it isn't invalid for the file system
layer to be able to query about the nature of the block device, such as
"how much space do you *really* have left?"

This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose
to fail in a cleaner way, but it gets too far today leading to a more
dangerous failure when it allocates some block, but not some other block.

Exaggerating this to say that thinp would become everything, and the answer
to the ultimate question of life, weakens your point to me, as it means
that you are seeing things in far too black + white, whereas real life is
often not black + white.

It is your opinion that extending thin volumes to allow the file system to
have more information is breaking some fundamental law. But, in practice,
this sort of thing is done all of the time. "Size", "Read only",
"Discard/Trim Support", "Physical vs Logical Sector Size", ... are all
information queried from the device, and used by the file system. If it is
a general concept that applies to many different device targets, and it
will help the file system make better and smarter choices, why *shouldn't*
it be communicated? Who decides which ones are valid and which ones are not?

I didn't disagree with all of your points. But, enough of them seemed to be
directly contradicting my perspective on the matter that I felt it
important to respond to them.

Mostly, I think everybody has a set of opinions and use cases in mind when
they come to their conclusions. Please don't ignore mine. If there is
something unreasonable above, please let me know.
--
Mark Mielke <***@gmail.com>
Gionatan Danti
2016-05-03 10:15:44 UTC
Permalink
Post by Mark Mielke
2) Frequent snapshots. In many of our use cases, we may take snapshots
every 15 minutes, every hour, and every day, keeping 3 or more of each.
If this storage had to be allocated in full, this amounts to at least
10X the storage cost. Using snapshots, and understanding the rate of
churn, we can use closer to 1X or 2X the storage overhead, instead of
10X the storage overhead.
3) Snapshot as a means of achieving a consistent backup at low cost of
outage or storage overhead. If we "quiesce" the application (flush
buffers, put new requests on hold, etc.) take the snapshot, and then
"resume" the application, this can be achieved in a matter of seconds or
less. Then, we can mount the snapshot at a separate mount point and
proceed with a more intensive backup process against a particular
consistent point-in-time. This can be fast and require closer to 1X the
storage overhead, instead of 2X the storage overhead.
This is exactly my main use case.
Post by Mark Mielke
The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block
device visible to the file system layer, and it isn't invalid for the
file system layer to be able to query about the nature of the block
device, such as "how much space do you *really* have left?"
As this warning appears on snapshots, it is quite annoying in fact. On
the other hand, I fully understand that the developers want to avoid
"blind" overprovisioning. A commmand-line (or a lvm.conf) option to
override the warning would be welcomed, though.
Post by Mark Mielke
This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if
the storage was not thin. Others, including me, think that this theory
is impractical, as it leads to edge cases where the file system could
choose to fail in a cleaner way, but it gets too far today leading to a
more dangerous failure when it allocates some block, but not some other
block.
...
It is your opinion that extending thin volumes to allow the file system
to have more information is breaking some fundamental law. But, in
practice, this sort of thing is done all of the time. "Size", "Read
only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
are all information queried from the device, and used by the file
system. If it is a general concept that applies to many different device
targets, and it will help the file system make better and smarter
choices, why *shouldn't* it be communicated? Who decides which ones are
valid and which ones are not?
This seems reasonable. After all, a simple "lsblk" already reports
plenty of information to the upper layer, so adding a
"REAL_AVAILABLE_SPACE" info should not be infeasible.
Post by Mark Mielke
I didn't disagree with all of your points. But, enough of them seemed to
be directly contradicting my perspective on the matter that I felt it
important to respond to them.
Thinp really is a wonderful piece of technology, and I really thanks the
developer for it.
Post by Mark Mielke
--
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2016-05-03 11:42:48 UTC
Permalink
Post by Mark Mielke
The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block
device visible to the file system layer, and it isn't invalid for the
file system layer to be able to query about the nature of the block
device, such as "how much space do you *really* have left?"
As this warning appears on snapshots, it is quite annoying in fact. On the
other hand, I fully understand that the developers want to avoid "blind"
overprovisioning. A commmand-line (or a lvm.conf) option to override the
warning would be welcomed, though.
Since number of reports from people who used thin-pool without realizing what
they could do wrong was too high - rather 'dramatic' WARNING approach is
used. Advised usage is with dmeventd & monitoring.

Danger with having 'disable' options like this is many distros do decide
themselves about best defaults for their users, but Ubuntu with their
issue_discards=1 shown us to be more careful as then it's not Ubuntu but lvm2
which is blamed for dataloss.

Options are evaluated...
Post by Mark Mielke
This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if
the storage was not thin. Others, including me, think that this theory
is impractical, as it leads to edge cases where the file system could
choose to fail in a cleaner way, but it gets too far today leading to a
more dangerous failure when it allocates some block, but not some other
block.
...
It is your opinion that extending thin volumes to allow the file system
to have more information is breaking some fundamental law. But, in
practice, this sort of thing is done all of the time. "Size", "Read
only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
are all information queried from the device, and used by the file
system. If it is a general concept that applies to many different device
targets, and it will help the file system make better and smarter
choices, why *shouldn't* it be communicated? Who decides which ones are
valid and which ones are not?
This seems reasonable. After all, a simple "lsblk" already reports plenty of
information to the upper layer, so adding a "REAL_AVAILABLE_SPACE" info should
not be infeasible.
What's wrong with 'lvs'?
This will give you the available space in thin-pool.

However combining this number with number of free-space in filesystem - that
needs magic.

When you create file with hole in your filesystem - how much free space do you
have ?

If you have 2 filesystem in a single thin-pool - each takes 1/2 ?
It's all about lying....


Regards

Zdenek
Gionatan Danti
2016-05-03 13:15:45 UTC
Permalink
Post by Zdenek Kabelac
Danger with having 'disable' options like this is many distros do decide
themselves about best defaults for their users, but Ubuntu with their
issue_discards=1 shown us to be more careful as then it's not Ubuntu but
lvm2 which is blamed for dataloss.
Options are evaluated...
Very true. "Sane defaults" is one of the reason why I (happily) use
RHEL/CentOS as hypervisors and other critical tasks.
Post by Zdenek Kabelac
What's wrong with 'lvs'?
This will give you the available space in thin-pool.
Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example
of the block device/layer exposing some (lack of) features to upper layer.

One note about the continued "suggestion" to use BTRFS. While for
relatively simple use case it can be ok, for more demanding
(rewrite-heavy) scenarios (eg: hypervisor, database, ecc) it performs
*really* bad, even when "nocow" is enabled.

I had much more fortune, performance wise, with ZFS. Too bad ZoL is an
out-of-tree component (albeit very easy to install and, in my
experience, quite stable also).

Anyway, ThinLVM + XFS is an extremely good combo in my opinion.
Post by Zdenek Kabelac
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8
Zdenek Kabelac
2016-05-03 15:45:11 UTC
Permalink
Post by Zdenek Kabelac
What's wrong with 'lvs'?
This will give you the available space in thin-pool.
Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example of
the block device/layer exposing some (lack of) features to upper layer.
One note about the continued "suggestion" to use BTRFS. While for relatively
It's not 'continued' suggestion.

It's just the example of solution where 'filesystem & block layer' are tied
together. Every solution has some advantages and disadvantages.
simple use case it can be ok, for more demanding (rewrite-heavy) scenarios
(eg: hypervisor, database, ecc) it performs *really* bad, even when "nocow" is
enabled.
So far I'm convinced layered design gives user more freedom - for the price
of bigger space usage.
Anyway, ThinLVM + XFS is an extremely good combo in my opinion.
Yes, thought ext4 is quite good as well...

Zdenek
Zdenek Kabelac
2016-05-03 09:45:29 UTC
Permalink
Post by Zdenek Kabelac
Thin-provisioning is NOT about providing device to the upper
system levels and inform THEM about this lie in-progress.
That's complete misunderstanding of the purpose.
I think this line of thought is a bit of a strawman.
Thin provisioning is entirely about presenting the upper layer with a logical
view which does not match the physical view, including the possibility for
such things as over provisioning. How much of this detail is presented to the
higher layer is an implementation detail and has nothing to do with "purpose".
The purpose or objective is to allow volumes that are not fully allocated in
advance. This is what "thin" means, as compared to "thick".
If you seek for a filesystem with over-provisioning - look at btrfs, zfs
and other variants...
I have to say that I am disappointed with this view, particularly if this is a
view held by Red Hat. To me this represents a misunderstanding of the purpose
Hi

So first - this is AMAZING deduction you've just shown.

You've cut sentence out of the middle of a thread and used as kind of evidence
that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this thread
again...

Personally I'd never use those 2 filesystems as they are to complex for
recovery. But I've no problem to advice users to try them if that's what fits
their needs best and they believe into 'all in once logic'
('Hit the wall' is best learning exercise in Xen case anyway...)
Post by Zdenek Kabelac
When a storage provider providers a block device (EMC, NetApp, ...) and a
snapshot capability, I expect to be able to take snapshots with low overhead.
The previous LVM model for snapshots was really bad, in that it was not low
This usage is perfectly fine. It's been designed this way from day 1.
Post by Zdenek Kabelac
1) Instantiating test environments or dev environments from a snapshot of
production, with copy-on-write to allow for very large full-scale environments
to be constructed quickly and with low overhead. In one of our examples, this
includes an example where we have about 1 TByte of JIRA and Confluence
attachments collected over several years. It is exposed over NFS by the NetApp
device, but in the backend it is a volume. This volume is snapshot and then
exposed as a different volume with copy-on-write characteristics. The storage
allocation is monitored, and if it is exceeded, it is known that there will be
particular behaviour. I believe in our case, the behaviour is that the
snapshot becomes unusable.
Thin pool does not make a difference between snapshot and origin.
All thin-volumes share the same volume space.

It's up to monitoring application to decide if some snapshots could be erased
to reclaim some space in thin-pool.

Recent tool thin_ls is showing info how much data are exclusively held by
individual thin volumes.

It's major difference compared with old snapshots and it's 'Invalidation' logic.
Post by Zdenek Kabelac
2) Frequent snapshots. In many of our use cases, we may take snapshots every
15 minutes, every hour, and every day, keeping 3 or more of each. If this
storage had to be allocated in full, this amounts to at least 10X the storage
cost. Using snapshots, and understanding the rate of churn, we can use closer
to 1X or 2X the storage overhead, instead of 10X the storage overhead.
Sure - snapper... whatever you name.
It's just for admin to maintain space availability in thin-pool.
Post by Zdenek Kabelac
3) Snapshot as a means of achieving a consistent backup at low cost of outage
or storage overhead. If we "quiesce" the application (flush buffers, put new
requests on hold, etc.) take the snapshot, and then "resume" the application,
this can be achieved in a matter of seconds or less. Then, we can mount the
snapshot at a separate mount point and proceed with a more intensive backup
process against a particular consistent point-in-time. This can be fast and
require closer to 1X the storage overhead, instead of 2X the storage overhead.
In all of these cases - we'll buy more storage if we need more storage. But,
we're not going to use BTRFS or ZFS to provide the above capabilities, just
And where exactly I'd advised to you specifically to switch to those filesystem?

My advice is clearly given to a user who seeks for filesystem COMBINED with
block layer.
Post by Zdenek Kabelac
because this is your opinion on the matter. Storage vendors of reputation and
market presence sell these capabilities as features, and we pay a lot of money
to have access to these features.
In the case of LVM... which is really the point of this discussion... LVM is
not necessarily going to be used or available on a storage appliance. The LVM
use case, at least for us, is for storage which is thinly provisioned by the
1) Local disks, particularly included local flash drives that are local to
achieve higher levels of performance than can normally be achieved with a
remote storage appliance.
2) Local file systems, on remote storage appliances, using a protocol such as
iSCSI to access the backend block device. This might be the case where we need
better control of the snapshot process, or to abstract the management of the
snapshots from the backend block device. In our case, we previously use an EMC
over iSCSI for one of these use cases, and we are switching to NetApp.
However, instead of embedding NetApp-specific logic into our code, we want to
use LVM on top of iSCSI, and re-use the LVM thin pool capabilities from the
host, such that we don't care what storage is used on the backend. The
management scripts will work the same whether the storage is local (the first
case above) or not (the case we are looking into now).
In both of these cases, we have a need to take snapshots and manage them
locally on the host, instead of managing them on a storage appliance. In both
cases, we want to take many light weight snapshots of the block device. You
could argue that we should use BTRFS or ZFS, but you should full well know
that both of these have caveats as well. We want to use XFS or EXT4 as our
needs require, and still have the ability to take light-weight snapshots.
Which is exactly actual Red Hat strategy. XFS is strongly pushed forward.
Post by Zdenek Kabelac
Generally, I've seen the people who argue that thin provisioning is a "lie",
tend to not be talking about snapshots. I have a sense that you are talking
more as storage providers for customers, and talking more about thinly
provisioning content for your customers. In this case - I think I would agree
that it is a "lie" if you don't make sure to have the storage by the time it
Thin-provisioning simply requires RESPONSIBLE admins - if you are not willing
to take care about your thin-pools - don't use them - lots of kitten may die
and that's all what this thread was about - it had absolutely nothing to do
with Red Hat and any of your conspiracy theories like it would be pushing you
to switch to a filesystem you don't like...
Post by Zdenek Kabelac
Device target is definitely not here to solve filesystem troubles.
Thinp is about 'promising' - you as admin promised you will provide
space - we could here discuss maybe that LVM may possibly maintain
max growth size we can promise to user - meanwhile - it's still the admin
who creates thin-volume and gets WARNING if VG is not big enough when all
thin volumes would be fully provisioned.
And THAT'S IT - nothing more.
So please avoid making thinp target to be answer to ultimate question of
life, the universe, and everything - as we all know it's 42...
The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already know,
and it is training me to ignore warnings. Thinp doesn't have to be the answer
to everything. It does, however, need to provide a block device visible to the
file system layer, and it isn't invalid for the file system layer to be able
to query about the nature of the block device, such as "how much space do you
*really* have left?"
This is not so useful information - as this state is dynamic.
The only 'valid' query is - are we out-of-space...
And that's what you get from block layer now - ENOSPC.
Filesystems may have different reaction then to plain EIO.


I'd be really curious what would be the use case of this information even ?

If you care about i.e. 'df' - then let's fix 'df' - it may check fs is
thinly provisioned volume and may ask provisioner about free space in pool and
combine result in some way...
Just DO NOT mix this with filesystem layer...

What would the filesystem do with this info ?

Should this randomly decide to drop files according to thin-pool workload ?

Would you change every filesystem in kernel to implement such policies ?

It's really the thin-pool monitoring which tries to add some space when it's
getting low and may implement further policies to i.e. drop some snapshots.

However what is being implemented is better 'allocation' logic for pool chunk
provisioning (for XFS ATM) - as rather 'dated' methods for deciding where to
store incoming data do not apply with provisioned chunks efficiently.
Post by Zdenek Kabelac
This seems to be a crux of this debate between you and the other people. You
think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose to
It's purely practical and it's the 'crucial' difference between

i.e. thin+XFS/ext4 and BTRFS.
Post by Zdenek Kabelac
fail in a cleaner way, but it gets too far today leading to a more dangerous
failure when it allocates some block, but not some other block.
The best thing to do is to stop immediately on error and do 'read-only' fs -
what is exactly 'ext4 + remount-ro'

Your proposal to make XFS a different kind of BTRFS monster is simply not
going to work - that's exactly what BTRFS is doing - waste of time to do it
again.

BTRFS has built-in volume manager and combines fs layer with block layer
(making many layers in kernel quite ugly - i.e. device major:minor)

This is different logic lvm2 takse - where layers are separated with clearly
defined logic.

So again - if you don't like separate thin block layer + XFS fs layer and you
want to see 'merged' technology - there is BTRFS/ZFS/.... which tries to
combine raid/caching/encryption/snapshot... - but there are no plans to
'reinvent' the same from the other side with lvm2/dm....
Post by Zdenek Kabelac
Exaggerating this to say that thinp would become everything, and the answer to
the ultimate question of life, weakens your point to me, as it means that you
are seeing things in far too black + white, whereas real life is often not
black + white.
Yes we prefer clearly defined borders and responsibilities which could be well
tested and verified..

Don't compare life with software :)
Post by Zdenek Kabelac
It is your opinion that extending thin volumes to allow the file system to
have more information is breaking some fundamental law. But, in practice, this
sort of thing is done all of the time. "Size", "Read only", "Discard/Trim
Support", "Physical vs Logical Sector Size", ... are all information queried
from the device, and used by the file system. If it is a general concept that
applies to many different device targets, and it will help the file system
make better and smarter choices, why *shouldn't* it be communicated? Who
decides which ones are valid and which ones are not?
lvm2 is logical volume manager. Just think about it.

In future your thinLV might be turned into plain 'linear' LV as well as your
linearLV would become a member of thin-pool (planned features).

Your LV could be pvmove(ed) to completely different drive with different
geometry...

These are topics for lvm2/dm.

We are not designing filesystem - and we do plan to stay transparent for them.

And it's up to you to understand the reasoning.
Post by Zdenek Kabelac
I didn't disagree with all of your points. But, enough of them seemed to be
directly contradicting my perspective on the matter that I felt it important
to respond to them.
It is an Open Souce World - "so send a patch" and implement your visions -
again it is that easy - we do it every day in Red Hat...
Post by Zdenek Kabelac
Mostly, I think everybody has a set of opinions and use cases in mind when
they come to their conclusions. Please don't ignore mine. If there is
something unreasonable above, please let me know.
It's not about ignoring - it's about having certain amount of man-hours for
work and you have to chose how to 'spend' them.

And in this case and your ideas you will need to spend/invest your time....
(Just like Xen).


Regards

Zdenek
Mark Mielke
2016-05-03 10:41:37 UTC
Permalink
Post by Zdenek Kabelac
Post by Zdenek Kabelac
If you seek for a filesystem with over-provisioning - look at btrfs, zfs
and other variants...
I have to say that I am disappointed with this view, particularly if this is a
view held by Red Hat. To me this represents a misunderstanding of the purpose
So first - this is AMAZING deduction you've just shown.
You've cut sentence out of the middle of a thread and used as kind of evidence
that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this
thread again...
My intent wasn't to cut a sentence in the middle. I responded to the each
sentence in its place. I think it really comes down to this:

This seems to be a crux of this debate between you and the other people. You
Post by Zdenek Kabelac
Post by Zdenek Kabelac
think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose to
It's purely practical and it's the 'crucial' difference between
i.e. thin+XFS/ext4 and BTRFS.
I think I captured the crux of this pretty well. If anybody suggests that
there could be value to exposing any information related to the nature of
the "thinly provisioned block devices", you suggest that the only route
forwards here is BTRFS and ZFS. You are saying directly and indirectly,
that anybody who disagrees with you should switch to what you feel are the
only solutions that are in this space, and that LVM should never be in this
space.

I think I understand your perspective. However, I don't agree with it. I
don't agree that the best solution is one that fails at the last instant
with ENOSPC and/or for the file system to become read-only. I think there
is a whole lot of grey possibilities between the polar extremes of
"BTRFS/ZFS" vs "thin+XFS/ext4 with last instant failure".

What started me on this list was the CYA mandatory warning about over
provisioning that I think is inappropriate, and causing us tooling
problems. But seeing the debate unfold, and having seen some related
failures in the Docker LVM thin pool case where the system may completely
lock up, I have a conclusion that this type of failure represents a
fundamental difference in opinion around what thin volumes are for, and
what place they have. As I see them as highly valuable for various reasons
including Docker image layers (something Red Hat appears to agree with,
having targeted LVM thinp instead of the union file systems), and the
snapshot use cases I presented prior, I think there must be a way to avoid
the worst scenarios, if the right people consider all the options, and
don't write off options prematurely due to preconceived notions about what
is and what is not appropriate in terms of communication of information
between system layers.

There are many types of information that *are* passed from the block device
layer to the file system layer. I don't see why awareness of thin volumes,
should not be one of them.

For example, and I'm not pretending this is the best idea that should be
implemented, but just to see where the discussion might lead:

The Linux kernel needs to deal with problems such as memory being swapped
out due to memory pressures. In various cases, it is dangerous to swap
memory out. The memory can be protected from being swapped out where
required using various technique such as pinning pages. This takes up extra
RAM, but ensures that the memory can be safely accessed and written as
required. If the file system has particular areas of importance that need
to be writable to prevent file system failure, perhaps the file system
should have a way of communicating this to the volume layer. The naive
approach here might be to preallocate these critical blocks before
proceeding with any updates to these blocks, such that the failure
situations can all be "safe" situations, where ENOSPC can be returned
without a danger of the file system locking up or going read-only.

Or, maybe I am out of my depth, and this is crazy talk... :-)

(Personally, I'm not really needing a "df" to approximate available
storage... I just don't want the system to fail badly in the "out of disk
space" scenario... I can't speak for others, though... I do *not* want
BTRFS/ZFS... I just want a sanely behaving LVM + XFS...)
--
Mark Mielke <***@gmail.com>
Zdenek Kabelac
2016-05-03 11:18:20 UTC
Permalink
Post by Zdenek Kabelac
If you seek for a filesystem with over-provisioning - look at
btrfs, zfs
and other variants...
I have to say that I am disappointed with this view, particularly if
this is a
view held by Red Hat. To me this represents a misunderstanding of the
purpose
So first - this is AMAZING deduction you've just shown.
You've cut sentence out of the middle of a thread and used as kind of evidence
that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this
thread again...
My intent wasn't to cut a sentence in the middle. I responded to the each
This seems to be a crux of this debate between you and the other
people. You
think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could
choose to
It's purely practical and it's the 'crucial' difference between
i.e. thin+XFS/ext4 and BTRFS.
I think I captured the crux of this pretty well. If anybody suggests that
there could be value to exposing any information related to the nature of the
"thinly provisioned block devices", you suggest that the only route forwards
here is BTRFS and ZFS. You are saying directly and indirectly, that anybody
who disagrees with you should switch to what you feel are the only solutions
that are in this space, and that LVM should never be in this space.
I think I understand your perspective. However, I don't agree with it. I don't
The perspective of lvm2 team is pretty simple as a small team there is
absolutely no time to venture into this road-path.

Also technically you are crying on the wrong grave/barking up the wrong tree.

Try to push you visions to some filesystem developers.
Post by Zdenek Kabelac
agree that the best solution is one that fails at the last instant with ENOSPC
and/or for the file system to become read-only. I think there is a whole lot
of grey possibilities between the polar extremes of "BTRFS/ZFS" vs
"thin+XFS/ext4 with last instant failure".
The other point is technical difficulties are very high and you are really
asking for Btrfs logic, you just fail to admit this to yourself.

It's been the 'core' idea of Btrfs to combine volume management and filesystem
together for a better future...
Post by Zdenek Kabelac
What started me on this list was the CYA mandatory warning about over
provisioning that I think is inappropriate, and causing us tooling problems.
But seeing the debate unfold, and having seen some related failures in the
Docker LVM thin pool case where the system may completely lock up, I have a
conclusion that this type of failure represents a fundamental difference in
opinion around what thin volumes are for, and what place they have. As I see
them as highly valuable for various reasons including Docker image layers
(something Red Hat appears to agree with, having targeted LVM thinp instead of
As you mention Docker - again I've no idea why do you think there is 'one-way'
path ?

Red Hat is not political party with a single leading direction.

Many variant are being implemented in parallel (yes even in Red Hat) and the
best one will win over the time - but there is no single 'directive' decision.
It really is the open source way.
Post by Zdenek Kabelac
the union file systems), and the snapshot use cases I presented prior, I think
there must be a way to avoid the worst scenarios, if the right people consider
all the options, and don't write off options prematurely due to preconceived
notions about what is and what is not appropriate in terms of communication of
information between system layers.
There are many types of information that *are* passed from the block device
layer to the file system layer. I don't see why awareness of thin volumes,
should not be one of them.
Find a use-case, build a patch, show results and add info what the filesystem
shall be doing when the filesystem underlying device changes its characteristic.

There is an API between block-layer and fs-layer - so propose extension
with a patch for a filesystem with clearly defined benefit.

That's my best advice.
Post by Zdenek Kabelac
communicating this to the volume layer. The naive approach here might be to
preallocate these critical blocks before proceeding with any updates to these
blocks, such that the failure situations can all be "safe" situations, where
ENOSPC can be returned without a danger of the file system locking up or going
read-only.
Or, maybe I am out of my depth, and this is crazy talk... :-)
Basically you are not realizing how much work is behind all those simple
sentences. At this moment there is 'fallocate' being in discussion...
But it's most or less 'nuclear weapon' for thin provisioning.
Post by Zdenek Kabelac
(Personally, I'm not really needing a "df" to approximate available storage...
I just don't want the system to fail badly in the "out of disk space"
scenario... I can't speak for others, though... I do *not* want BTRFS/ZFS... I
just want a sanely behaving LVM + XFS...)
Yes - that's what we try to improve daily.


Regards

Zdenek
Xen
2016-05-03 12:42:16 UTC
Permalink
Post by Zdenek Kabelac
I'm not going to add much to this thread - since there is nothing
If you like to keep things short now I will give short replies. Also
other people have responded and I haven't read everything yet.
Post by Zdenek Kabelac
it's still the admin who creates thin-volume and gets WARNING if VG is
not big enough when
all thin volumes would be fully provisioned.
That is just what we could call insincere or that beautiful strange word
that I cannot remember.

The opposite of innocuous. Disingenuous (thank you dictionary).

You know perfectly well that this warning doesn't do much of anything
when all people approach thin from the view point of wanting to
overprovision.

That is like saying "Don't enter this pet store here, because you might
buy pets, and pets might scratch your arm. Now what can we serve you
with?".

It's those insincere warnings many business or ideas give to people to
supposedly warn them in advance of what they want to do anyway. "I told
you it was a bad idea, now what can we do for you? :) :) :) :)". It's a
way of being politically correct mostly.

You want to do it anyway. But now someone tells you it might be a bad
idea even if both of you want it.
Post by Zdenek Kabelac
So you try to design 'another btrfs' on top of thin provisioning?
Maybe I am. At least you recognise that I am trying to design something,
many people would just throw it in the wastebasket with "empty
complains".

That in itself.... ;-)

speaks some volumes.

But let's talk about real volumes now :p.

There's nothing bad about btrfs except that it usurps everything,
doesn't separate any layers, and just overall means the end and death of
a healthy filesystem system. It wants to be the monopoly.
Post by Zdenek Kabelac
With 'thinp' you want simplest filesystem with robust metadata - so
in theory - 'ext4' or XFS without all 'improvements for rotational
hdd that has accumulated over decades of their evolution.
I agree. I don't even use ext4, I use ext3. I feel ext4 may have some
benefits but they are not really worth anything.
Post by Zdenek Kabelac
You miss the 'key' details.
Thin pool is not constructing 'free-maps' for each LV all the time -
that's why tools like 'thin_ls' are meant to be used from the
user-space.
It IS very EXPENSIVE operation.
So before you start to present your visions here, please spend some
time with reading doc and understanding all the technology behind it.
Sure I could do that. I could also allow myself to die without ever
having contributed to anything.
Post by Zdenek Kabelac
Post by Xen
Even with a perfect LVM monitoring tool, I would experience a
consistent
lack of feedback.
Mistake of your expectations
It is nothing to do with expectations. Things and feeling that keep
creeping up to you and keep annoying you have nothing to do with
expectations.

That is like being thoroughly annoyed about something for years and
expecting it to go away by itself, is the epitome of sanity.

For example: monitor makes buzzing noise when turned off. Deeply
frustrating, annoying, downright bad. Gives me nightmares even. You say
"You have bad expectations of hardware, hardware just does that thing,
you have to live with it." I go to shop, shop says "Yeah all hardware
does that (so we don't need to pay you anything back)".

That has nothing to do with bad expectations.
Post by Zdenek Kabelac
If you are trying to operate thin-pool near 100% fullness - you will
need to write and design completely different piece of software -
sorry thinp
is not for you and never will...
I am not trying to operate near 100% fullness.

Although it wouldn't be bad if I could manage that.

That would not be such a bad thing at all. If the tools where there to
actually do it, and the mechanisms. Wouldn't you agree? Regardless of
what is possible or even what is to be considered "wise" here, wouldn't
it be beneficial in some way?
Post by Zdenek Kabelac
Post by Xen
Just a simple example: I can adjust "df" to do different stuff. But any
program reporting free diskspace is going to "lie" to me in that sense. So
yes I've chosen to use thin LVM because it is the best solution for me
right now.
'df' has nothing in common with 'block' layer.
A clothing retailer has nothing in common with a clothing manufacturer
either, but they are just both in the same business.
Post by Zdenek Kabelac
But if you've never planned to buy 10TB - you should have never allow
to create such big volume in the first place!
So you are like saying the only use case of thin is a growth scenario
that can be met.
Post by Zdenek Kabelac
So don't do it - and don't plan to use it - it's really that simple.
What I was saying was that it would be possible to maintain the contract
that any individual volume at any one time would be able to grow to max
size as long as other volumes don't start acting aberrant. If you manage
all those volumes of course you would be able to choose this.

The purpose of the thin system is to maintain the situation that all
volumes can reach their full potential without (auto)extending, in that
sense.

If you did actually make a 1TB volume for a single client with a 10TB
V-size, you would be a very bad contractor. Who says it is not going to
happen overnight? How will you be able to respond.

The situation where you have a 10TB volume and you have 20 clients with
1TB each, is very different.

I feel the contract should be that the available real space should
always be equal to or greater than the available on any one filesystem
(volume).

So: R >= max(A(1), A(2), A(3), ..., A(n))

Of course it is pleasant not having to resize the filesystem but would
you really do that for yourself? Make a 10TB filesystem on a 1TB disk as
you expect to buy more disks in the future?

I mean you could. But in this sense resizing the filesystem (growing it)
is not a very expensive operation, usually.

I would only want to do that if I could limit the actual usage of the
filesystem in a real way.

Any runaway process causing my volume to drop...... NOT a good thing.
Post by Zdenek Kabelac
Actually it's the core principle!
It lies (or better say uses admin's promises) that there is going to
be a disk space. And it's admin responsibility to fulfill it.
The admin never comes into it. What the admin does or doesn't do, what
the admin thinks or doesn't think. These are all interpretations of
intents.

Thinp should function regardless of what the admin is thinking or not.
Regardless of what his political views are.

You are bringing morality into the technical system.

You are saying /thinp should work/ because /the admin should be a good
person/.

When the admin creates the system, no "promise" is ever communicated to
the hardware layer, OR the software layer. You are turning the correct
operation of the machine into a human problem in the way of saying
"/Linux is a great system and everyone can use it, but some people are
just too stupid to spend a few hours reading a manual on a daily basis,
and we can't help that/".

These promises are not there in the system. Someone might be using the
system for reasons you have not envisioned. But the system is there and
it can be used for it. Now if things go wrong you say "You you had the
wrong use case" but a use case is just a use case, it has no morality to
it.

If you build a waterway system that only functions as long as it doesn't
rain (overflowing the channels) then you can say "Well my system is
perfect, it is just God who is a bitch and messes things up".

No you have to take account of real life human beings, not those ideal
pictures of admins that you have.

Stop the idealism you know. Admins are humans and they can be expected
to be humans.

It is you who have wrong expectations of people.

If people mess up they mess up but it is part of the human agenda and
you design for that.
Post by Zdenek Kabelac
If you know in front you will need quickly all the disk space - then
using thinp and expecting miracle is not going to work.
Nobody ever said anything of that kind.
Xen
2016-04-28 18:20:15 UTC
Permalink
Let me just write down some thoughts here.

First of all you say that fundamental OS design is about higher layers
trusting lower layers and that certain types of communications should then
always be one way.

In this case it is about block layer vs. file system layer.

But you make certain assumptions about the nature of a block device to
begin with.

A block device is defined by its acess method (ie. data organized in
blocks) rather than its contigiousness or having an unchanging, "single
block" address or access space. I know this goes pretty far but it is the
truth.

In theory there is nothing against a hypothetical block device offering
ranges of blocks to a higher level (that might never change) or to be
dynamically notified of changes to that address pool.

To a process virtual memory is a space that is transparent to it whether
that space is constructed of paged memory (swap file) or not. At the same
time it is not impossible to imagine that an IO scheduler for swap would
take heed of values given by applications, such as using nice or ionice
values. That would be one way communication though.

In general a higher level should be oblivious to what kind of lower level
layer it is running on, you are right. Yet if all lower levels exhibit the
same kind of features, this point becomes moot, because at that point the
higher level will not be able to know, once more, precisely what kind of
layer it is running on, although it would have more information.

So just theoretically speaking the only thing that is required to be
consistent is the API or whatever interface you design for it.

I think there are many cases where some software can run on some libraries
but not on others because those other libraries do not offer the full
feature set of whatever standard is being defined there. An example is
DLNA/UPNP, these are not layers but the standard is ill-defined and the
device you are communicating with might not support the full set.

Perhaps these are detrimental issues but there are plenty of cases where
one type of "lower level" will suffice but another won't, think maybe of
graphics drivers. Across the layer boundary, communication is two-way
anyway. The block device *does* supply endless streams of data to the
higher layer. The only thing that would change is that you would no longer
have this "always one contigious block of blocks" but something that is
slightly more volatile.

When you "mkfs" the tool reads the size of the block device. Perhaps
subsequently the filessytem is unaware and depends on fixed values.

The feature I described (use case) would allow the set of blocks that is
available, to dynamically change. You are right that this would apparently
be a big departure from the current model.

So I'm not saying it is easy, perfect, or well understood. I'm just saying
I like the idea.

I don't know what other applications it might have but it depends entirely
on correct "discard" behaviour from the filesystem.

The filesystem should be unaware of its underlying device but discard is
never required for rotating disks as far as I can tell. This is an option
that assumes knowledge of the underlying device. From discard we can
basically infer that either we are dealing with a flash device or
something that has some smartness about what blocks it retains and what
not (think cache).

So in general this is already a change that reflects changing conditions
of block devices in general or its availability. And its characteristic
behaviour or demands from filesystems.

These are block devices that want more information to operate (well).

Coincidentally, discard also favours or enhances (possibly) lvmcache.

So it's not about doing something wildly strange here, it's about offering
a feature set that a filesystem may or may not use, or a block device may
or may not offer.

Contrary to what you say, there is nothing inherently bad about the idea.
The OS design principle violation you speak of is principle, not practical
reality. It's not that it can't be done. It's that you don't want it to
happen because it violates your principles. It's not that it wouldn't
work. It's that you don't like it to work because it violates your
principles.

At the same time I object to the notion of the system administrator being
this theoretical vastly differing role/person than the user/client.

We have no in-betweens on Linux. For fun you should do a search of your
filesystem with find -xdev based on the contents of /etc/passwd or
/etc/group. You will find that 99% of files are owned by root and the only
ones that aren't are usually user files in the home directory or specific
services in /var/lib.

Here is a script that would do it for groups:

cat /etc/group | cut -d: -f1 | while read g; do printf "%-15s %6d" $g
`find / -xdev -type f -group $g | wc -l`; done

Probably. I can't run it here it might crash my system (live dvd).

Of about 170k files on an OpenSUSE system, 15 were group writable, mostly
due to my own interference probably. Of 170197 files (no xdev) 168161 were
owned by root.

Excluding man and my user, 69 files did not have "root" as the group. Part
of that was again due to my own changes.

At the same time in some debates your are presented with the ludicrous
notion that there is some ideal desktop user who doesn't need to ever see
anything of the internal system. She never opens a shell and certainly
does not come across ethernet device names (for example). The "desktop
user" does not care about the naming of devices from /dev/eth0 to
/sys/class/net/enp3s0.

The desktop user never uses anything other than DHCP, etc. etc. etc.

The desktop user never can configure anything without the help of the
admin, if it is slightly more advanced.

It's that user vs. admin dichotomy that is never true on any desktop
system and I will venture it is not even true on the systems I am a client
of, because you often need to debate stuff with the vendor or ask for
features, offer solutions, etc.

In a store you are a client. There are employees and clients, nothing
else. At the same time I treat these girls as my neighbours because they
work in the block I live in.

You get the idea. Roles can be shifty. A person can use multiple roles at
the same time. He/she can be admin and user simulaneously.

Perhaps you are correct to state that the roles themselves should not be
watered down, that clear delimitations are required.

In your other email you allude to me not ever having done an OS design
course.

Offlist a friendly member suggested strongly I not use personal attacks in
my communications here. But of course this is precisely what you are doing
here, because as a matter of fact I did follow such a course.

I don't remember the book we used because apparently between my house mate
and me we only had one exemplar and he ended up getting it because I was
usually the one borrowing stuff from him.

At the same time university is way beyond my current reach (in living
conditions) so it is just an unwarranted allusion that does not have
anything to do with anything really.

Yes I think it was the dinosaur book:

Operating System Concepts by Silberschatz, Galvin and Gagne

Anyway, irrelevant here.
Post by matthew patton
Another way (haven't tested) to 'signal' the FS as to the true state of
the underlying storage is to have a sparse file that gets shrunk over
time.
You do realize you are trying to find ways around the limitation you just
imposed on yourself right?
Post by matthew patton
The system admin decided it was a bright idea to use thin pools in the
first place so he necessarily signed up to be liable for the hazards and
risks that choice entails. It is not the job of the FS to bail his ass
out.
I don't think thin pools are that risky or should be that risky. They do
incur a management overhead compared to static filesystems because of
adding that second layer you need to monitor. At the same time the burden
of that can be lessened with tools.

As it stands I consider thin LVM the only reasonably way to snapshot a
running system without dedicating specific space to it in advance. I could
expect snaphotting to require stuff to be in the same volume group.
Without LVM thin, snapshotting requires making at least some prior
investment in having a snapshot device ready for you in the same VG,
right?
Post by matthew patton
Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.
But btrfs is not without complexity. It uses subvolumes that differ from
distribution to distribution as each makes its own choice. It requires
knowledge of more complicated tools and mechanics to do the simplest (or
most meaningful) of tasks. Working with LVM is easier. I'm not saying LVM
is perfect and....

Using snapshotting as a backup measure is something that seems risky to me
at the first place because it is a "partition table" operation which
really you shouldn't be doing on a consistent basis. So in other to
effectively use it in the first place you require tools that handle the
safeguards for you. Tools that make sure you are not making some command
line mistake. Tools that simply guard against misuse.

Regular users are not fit for being btrfs admins either.

It is going to confuse the hell out of people seeing as that what their
systems run on and if they are introduced to some of the complexity of it.

You say swallow your pride. It has not much to do with pride.

It has to do with ending up in a situation I don't like. That is then
going to "hurt" me for the remainder of my days until I switch back or get
rid of it.

I have seen NOTHING NOTHING NOTHING inspiring about btrfs.

Not having partition tables and sending volumes across space and time to
other systems, is not really my cup of tea.

It is a vendor lock-in system and would result in other technologies being
lesser developed.

I am not alone in this opinion either.

Btrfs feels like a form of illness to me. It is living in a forest with
all deformed trees, instead of something lush and inspiring. If you've
ever played World of Warcraft, the only thing that comes a bit close is
the Felwood area ;-).

But I don't consider it beyond Plaguelands either.

Anyway.

I have felt like btrfs in my life. They have not been the happiest moments
of my life ;-).

I will respond more in another mail, this is getting too long.
matthew patton
2016-04-28 13:46:03 UTC
Permalink
Post by Marek Podmaka
Post by Xen
The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.
Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.
Post by Marek Podmaka
kernel for communication from lower fs layers to higher layers -
Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.

A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on the thin-pool were set to breach actual block availability, the script would refuse to grow the FS and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.

Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.

But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled.

Mindless use of thin-pools is akin to crossing a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.
matthew patton
2016-05-03 12:00:45 UTC
Permalink
written as required. If the file system has particular areas
of importance that need to be writable to prevent file
system failure, perhaps the file system should have a way of
communicating this to the volume layer. The naive approach
here might be to preallocate these critical blocks before
proceeding with any updates to these blocks, such that the
failure situations can all be "safe" situations,
where ENOSPC can be returned without a danger of the file
system locking up or going read-only.
why all of a sudden does each and every FS have to have this added code to second guess the block layer? The quickest solution is to mount the FS in sync mode. Go ahead and pay the performance piper. It's still not likely to be bullet proof but it's a sure step closer.

What you're saying is that when mounting a block device the layer needs to expose a "thin-mode" attribute (or the sysdmin sets such a flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry (does that actually work reliably?).

Then there has to be code in every FS block de-stage path:
IF thin {
tickle block layer to allocate the block (aka write zeros to it? - what about pre-existing data, is there a "fake write" BIO call that does everything but actually write data to a block but would otherwise trigger LVM thin's extent allocation logic?)
IF success, destage dirty block to block layer ELSE
inform userland of ENOSPC
}

In a fully journal'd FS (metadata AND data) the journal could be 'pinned' and likewise the main metadata areas if for no other reason they are zero'd at onset and or constantly being written to. Once written to, LVM thin isn't going to go back and yank away an allocated extent.

This at least should maintain FS integrity albeit you may end up in a situation where the journal can never get properly de-staged, so you're stuck on any further writes and need to force RO.
just want a sanely behaving LVM + XFS...)
IMO if the system admin made a conscious decision to use thin AND overprovision (thin by itself is not dangerous), it's up to HIM to actively manage his block layer. Even on million dollar SANs the expectation is that the engineer will do his job and not drop the mic and walk away. Maybe the "easiest" implementation would be a MD layer job that the admin can tailor to fail all allocation requests once extent count drops below a number and thus forcing all FS mounted on the thinpool to go into RO mode.

But in any event it won't prevent irate users from demanding why the space they appear to have isn't actually there.
Xen
2016-05-03 14:38:38 UTC
Permalink
Just want to respond to this just to make things clear.
Post by matthew patton
why all of a sudden does each and every FS have to have this added
code to second guess the block layer? The quickest solution is to
mount the FS in sync mode. Go ahead and pay the performance piper.
It's still not likely to be bullet proof but it's a sure step closer.
Why would anyone do what you don't want to do. Don't suggest solutions
you don't even want yourself. That goes for all of you (Zdenek mostly).

And it is not second guessing. It is second guessing what it is doing
currently. If you have actual information from the block layer, you
don't NEED to second guess.

Isn't that obvious?
Post by matthew patton
What you're saying is that when mounting a block device the layer
needs to expose a "thin-mode" attribute (or the sysdmin sets such a
flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid
mode geometry (does that actually work reliably?).
Not necessarily. It could be transparent if these were actual available
features as part of a feature set. The features would individually be
able to be turned on and off, not necessarily calling it "thin".
Post by matthew patton
IF thin {
tickle block layer to allocate the block (aka write zeros to it? -
what about pre-existing data, is there a "fake write" BIO call that
does everything but actually write data to a block but would otherwise
trigger LVM thin's extent allocation logic?)
IF success, destage dirty block to block layer ELSE
inform userland of ENOSPC
}
What Mark suggested is not actually so bad. Preallocating means you have
to communicate in some way to the user that space is going to run out.
My suggestion would have been and still is in that sense to simply do
this by having the filesystem update the amount of free space.
Post by matthew patton
This at least should maintain FS integrity albeit you may end up in a
situation where the journal can never get properly de-staged, so
you're stuck on any further writes and need to force RO.
I'm glad you think of solutions.
Post by matthew patton
IMO if the system admin made a conscious decision to use thin AND
overprovision (thin by itself is not dangerous)
Again, that is just nonsense. There is not a person alive who wants to
use thin for something that is not overprovisioning, whether it be
snapshots or client sharing.

You are trying to get away with "hey, you chose it! now sucks if we
don't actually listen to you! hahaha."

SUCKER!!!!.

No, the primary use case for thin is overprovisioning.
Post by matthew patton
, it's up to HIM to
actively manage his block layer.
Block layer doesn't come into play with it.

You are separating "main admin task" and "local admin task".

What I mean is that there are different roles. Even if they are the same
person, they are different tasks.

Someone writing software, his task is to ensure his software keeps
working given failure conditions.

This software writer, even if it is the same person, cannot be expected
to at that point be thinking of LVM block allocation. These are
different things.

You communicate with the layers you communicate with. You don't go
around that.

When you write a system that is supposed to be portable, for instance,
you do not start depending on other features, tools or layers that are
out of reach the moment your system or software is deployed somewhere
else.

Filesystem communication is available to all applications. So any
application designed for a generic purpose of installment is going to be
wanting to depend on filesystem tools, not block layer tools.

You people apparently don't understand layering very well OR you would
never recommend avoiding an intermediate layer (the filesystem) to go
directly to the lower level (the block layer) for ITS admin tools.

I mean are you insane. You (Zdenek mostly) is so much about not mixing
layers but then it is alright to go around them?

A software tool that is meant to be redeployable and should be able to
depend on a minimalist set of existing features in the direct layer it
is interfacing with, but still wants to use whatever is available given
circumstances that dictate that it wouldn't harm its redeployability,
would never choose the acquire and use the more remote and more
uncertain set (such as LVM) when it could also be using directly
available measures (such as free disk space, as a crude measure) that
are available on ANY system provided that yes indeed, there is some
level of sanity to it.

If you ARE deployed on thin and the filesystem cannot know about actual
space then you are left in the dark, you are left blind, and there is
nothing you can do as a systems programmer.
Post by matthew patton
Even on million dollar SANs the
expectation is that the engineer will do his job and not drop the mic
and walk away.
You constantly focus on the admin.

With all of this hotshot and idealist behaviour about layers you are
espousing, you actually advocate going around them completely and using
whatever deepest-layer or most-impact solution that is available (LVM)
in order to troubleshoot issues that should be handled by interfacing
with the actual layer you always have access to.

It is not just about admins. You make this about admins as if they are
solely responsible for the entire system.
Post by matthew patton
Maybe the "easiest" implementation would be a MD layer job that the
admin can tailor to fail all allocation requests once
extent count drops below a number and thus forcing all FS mounted on
the thinpool to go into RO mode.
A real software engineer doesn't go for the easiest solution or
implementation. I am not approaching this from the perspective of an
admin exclusively. I am also and most and more importantly a software
programmer that wants to use systems that are going to work regardless
of the pecularities of an implementation or system I have to work on ,
and I don't leave it to the admin of said system to do all my tasks.

As a programmer I cannot decide that the admin is going to be a perfect
human being like so you well try to believe in, because that's what you
think you are: you are that amazing admin that never fails taking
account of available disk space.

But that's a moron position.

If I am to write my software, I cannot depend on bigger-scale or
outer-level solutions to always be in place. I cannot offload my
responsibilities to the admin.

You are insisting here that layers (administration layers and tasks) are
mixed and completely destroyed, all in the sense of not doing that to
the software itself?

Really?

Most importantly if I write any system that cannot depend on LVM being
present, then NO THOSE TOOLS ARE NOT AVAILABLE TO ME.

"Why don't you just use LVM?" well fuck off.

I am not that admin. I write his system. I don't do his work.

Yet I still have the responsibility that MY component is going to work
and not give HIM headaches. That's real life for you.

Even if in actually I might be imprisoned with broken feet and arms. I
still care about this and I still do this work in a certain sense.

And yes I utterly care about modularity in software design. I understand
layers much better than you do if you are able or even capable of
suggestion such solutions.

Communication between layers does not necessarily integrate the layers
if those interfaces are well defined and allow for modular "changing" of
the chose solution.

I recognise full well that there is integration and that you do get a
working together. But that is the entire purpose of it. To get the two
things to work together more. But that is the whole gist of having
interfaces and APIs in the first place.

It is for allowing stuff to work together to achieve a higher goal than
they could achieve if they were just on their own.

While recognising where each responbility lies.

BLOCK LAYER <----> BLOCK LAYER ADMIN
FILESYSTEM LAYER <----> FILESYSTEM LAYER ADMIN
APPLICATION LAYER <---> APPLICATON WRITER.

Me, the application writer, cannot be expected to deal with number one,
the block layer.

At the same time I need tools to do my work. I also cannot go to any
random block layer admin my system might get deployed on (who's to say I
will be there?) and beg for him to spend ample amount of time designing
his systems from scratch so that even if my software fails, it won't
hurt anyone.

But without information on available space I might not be able to do
anything.

Then what happens is that I have to design for this uncertainty.

Then what happens is that I (with capital IIIII) start allocating space
in advance as a software developer making applications for systems that
might I don't know, run on banks or whatever. Just saying something.

Yes now this task is left to the software designer making the
application.

Now I have to start allocating buffers to ensure graceful shutdown or
termination, for instance.

I might for instance allocate a block file, and if writes to the
filesystem start to fail or the filesystem becomes read-only, I might
still be in trouble not being able to write to it ;-). So I might start
thinking about kernel modules that I can redeploy with my system that
ensure graceful shutdown or even continued operation. I might decide
that files mounted as loopback are going to stay writable even if the
filesystem they reside on is now readonly. I am going to ensure these
are not sparse blocks and that the entire file is written to and grown
in advance, so that my writes start to look like real block device
writes. Then I'm just going to just patch the filesystem or the VFS to
allow writes to these files even if it comes with a performance hit of
additional checks.

And that hopefully not the entire volume gets frozen by LVM.

But that the kernel or security scripts just remount it ro.

That is then the best way solution for my needs in that circumstance.
Just saying you know.

It's not all exclusively about admins working with LVM directly.
Post by matthew patton
But in any event it won't prevent irate users from demanding why the
space they appear to have isn't actually there.
If that is your life I feel sorry for you.

I just do.
Mark Mielke
2016-05-04 01:25:11 UTC
Permalink
Post by matthew patton
written as required. If the file system has particular areas
of importance that need to be writable to prevent file
system failure, perhaps the file system should have a way of
communicating this to the volume layer. The naive approach
here might be to preallocate these critical blocks before
proceeding with any updates to these blocks, such that the
failure situations can all be "safe" situations,
where ENOSPC can be returned without a danger of the file
system locking up or going read-only.
why all of a sudden does each and every FS have to have this added code to
second guess the block layer? The quickest solution is to mount the FS in
sync mode. Go ahead and pay the performance piper. It's still not likely to
be bullet proof but it's a sure step closer.
Not all of a sudden. From "at work" perspective, LVM thinp as a technology
is relatively recent, and only recently being deployed in more places as we
migrate our systems from RHEL 5 to RHEL 6 to RHEL 7. I didn't consider
thinp an option before RHEL 7, and I didn't consider it stable even in RHEL
7 without significant testing on our part.
Post by matthew patton
From an "at home" perspective, I have been using LVM thinp from the day it
was available in a Fedora release. The previous snapshot model was
unusable, and I wished upon a star that a better technology would arrive. I
tried BTRFS and while it did work - it was still marked as experimental, it
did not have the exact same behaviour as EXT4 or XFS from an applications
perspective, and I did encounter some early issues with subvolumes.
Frankly... I was happy to have LVM thinp, and glad that you LVM developers
provided it when you did. It is excellent technology from my perspective.
But, "at home", I was willing to accept some loose edge case behaviour. I
know when I use storage on my server at home, and if it fails, I can accept
the consequences for myself.

"At work", the situation is different. These are critical systems that I am
betting LVM on. As we begin to use it more broadly (after over a year of
success in hosting our JIRA + Confluence instances on local flash using LVM
thinp for much of the application data including PostgreSQL databases). I
am very comfortable with it from a "< 80% capacity" perspective. However,
every so often it passes 80%, and I have to raise the alarm, because I know
that there are edge cases that LVM / DM thinp + XFS don't handle quite so
well. It's never happened in production yet, but I've seen it happen many
times on designer desktops when they are using LVM, and they lock up their
system and require a system reboot to recover from.

I know there are smart people working on Linux, and smart people working on
LVM. Give the opportunity, and the perspective, I think the worst of these
cases are problems that deserve to be addressed, and probably that people
have been working on with or without my contributions to the subject.
Post by matthew patton
What you're saying is that when mounting a block device the layer needs to
expose a "thin-mode" attribute (or the sysdmin sets such a flag via
tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry
(does that actually work reliably?).
IF thin {
tickle block layer to allocate the block (aka write zeros to it? - what
about pre-existing data, is there a "fake write" BIO call that does
everything but actually write data to a block but would otherwise trigger
LVM thin's extent allocation logic?)
IF success, destage dirty block to block layer ELSE
inform userland of ENOSPC
}
In a fully journal'd FS (metadata AND data) the journal could be 'pinned'
and likewise the main metadata areas if for no other reason they are zero'd
at onset and or constantly being written to. Once written to, LVM thin
isn't going to go back and yank away an allocated extent.
Yes. This is exactly the type of solution I was thinking of including
pinning the journal! You used the correct terminology. I can read the terms
but not write them. :-)

You also managed to summarize it in only a few lines of text. As concepts
go, I think that makes it not-too-complex.

But, the devil is often in the details, and you are right that this is a
per-file system cost.

Balancing this, however, I am perhaps presuming that *all* systems will
eventually be thin volume systems, and that correct behaviour and highly
available behaviour will eventually require that *all* systems invest in
technology such as this. My view of the future is that fixed sized thick
partitions are very often a solution which is compromised from the start.
Most systems of significance grow over time, and the pressure to reduce
cost is real. I think we are taking baby steps to start, but that the
systems of the future will be thin volume systems. I see this as a problem
that needs to be understood and solved, except in the most limited of use
cases. This is my opinion, which I don't expect anybody to share.
Post by matthew patton
This at least should maintain FS integrity albeit you may end up in a
situation where the journal can never get properly de-staged, so you're
stuck on any further writes and need to force RO.
Interesting to consider. I don't see this as necessarily a problem - or
that it necessitates "RO" as a persistent state. For example, it would be
most practical if sufficient room was reserved to allow for content to be
removed, allowing for the file system to become unwedged and become "RW"
again. Perhaps there is always an edge case that would necessitate a
persistent "RO" state that requires the volume be extended to recover from,
but I think the edge case could be refined to something that will tend to
never happen?
Post by matthew patton
just want a sanely behaving LVM + XFS...)
IMO if the system admin made a conscious decision to use thin AND
overprovision (thin by itself is not dangerous), it's up to HIM to actively
manage his block layer. Even on million dollar SANs the expectation is that
the engineer will do his job and not drop the mic and walk away. Maybe the
"easiest" implementation would be a MD layer job that the admin can tailor
to fail all allocation requests once extent count drops below a number and
thus forcing all FS mounted on the thinpool to go into RO mode.
Another interesting idea. I like the idea of automatically shutting down
our applications or PostgreSQL database if the thin pool reaches an unsafe
allocation, such as 90% or 95%. This would ensure the integrity of the
data, at the expense of an outage. This is something we could implement
today. Thanks.
Post by matthew patton
But in any event it won't prevent irate users from demanding why the space
they appear to have isn't actually there.
Users will always be irate. :-) I mostly don't consider that as a real
factor in my technical decisions... :-)

Thanks for entertaining this discussion, Matthew and Zdenek. I realize this
is an open source project, with passionate and smart people, whose time is
precious. I don't feel I have the capability of really contributing code
changes at this time, and I'm satisfied that the ideas are being considered
even if they ultimately don't get adopted. Even the mandatory warning about
snapshots exceeding the volume group size is something I can continue to
deal with using scripting and filtering. I mostly want to make sure that my
perspective is known and understood.
--
Mark Mielke <***@gmail.com>
Xen
2016-05-04 18:16:41 UTC
Permalink
Post by Mark Mielke
Thanks for entertaining this discussion, Matthew and Zdenek. I realize
this is an open source project, with passionate and smart people,
whose time is precious. I don't feel I have the capability of really
contributing code changes at this time, and I'm satisfied that the
ideas are being considered even if they ultimately don't get adopted.
Even the mandatory warning about snapshots exceeding the volume group
size is something I can continue to deal with using scripting and
filtering. I mostly want to make sure that my perspective is known and
understood.
You know, you really don't need to be this apologetic even if I mess up
my own replies ;-).

I think you have a right and a reason to say what you've said, and
that's it.

matthew patton
2016-05-03 13:01:30 UTC
Permalink
On Mon, 5/2/16, Mark Mielke <***@gmail.com> wrote:

<quote>
very small use case in reality. I think large service
providers would use Ceph or EMC or NetApp, or some such
technology to provision large amounts of storage per
customer, and LVM would be used more at the level of a
single customer, or a single machine.
</quote>

Ceph?!? yeah I don't think so.

If you thin-provision an EMC/Netapp volume and the block device runs out of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE. They don't even go RO. Poof, they disappear. Why? Because there is no guarantee that every NFS client, every iSCSI client, every FC client is going to do the right thing. The only reliable means of telling everyone "shit just broke" is for the asset to disappear.

All in-flight writes to the volume that the array ACK'd are still good even if they haven't been de-staged to the intended device thanks to NVRAM and the array's journal device.

<quote>
In these cases, I
would expect that LVM thin volumes should not be used across
multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation
that would be required.
</quote>

sure, but that spells responsible sysadmin. Xen's post implied he didn't want to be bothered to manage his block layer that magically the FS' job was to work closely with the block layer to suss out when it was safe to keep accepting writes. There's an answer to "works closely with block layer" - it's spelled BTRFS and ZFS.

LVM has no obligation to protect careless sysadmins doing dangerous things from themselves. There is nothing wrong with using THIN every which way you want just as long as you understand and handle the eventuality of extent exhaustion. Even thin snaps go invalid if it needs to track a change and can't allocate space for the 'copy'.

Responsible usage has nothing to do with single vs multiple customers. Though Xen broached the 'hosting' example and in the cut-rate hosting business over-provisioning is rampant. It's not a problem unless the syadmin drops the ball.
Amazon would make sure to have enough storage to meet my requirement if I need them.
Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools to manage the fact they are thin-provisoning and to make damn sure they can cash the checks they are writing.

 
the nature of the block device, such as "how much space
do you *really* have left?"
So you're going to write and then backport "second guess the block layer" code to all filesystems in common use and god knows how many versions back? Of course not. Just try to get on the EXT developer mailing list and ask them to write "block layer second-guessing code (aka branch on device flag=thin)" because THINP will cause problems for the FS when it runs out of extents. To which the obvious and correct response will be "Don't use THINP if you're not prepared to handle it's pre-requisites."
you and the other people. You think the block storage should
be as transparent as possible, as if the storage was not
thin. Others, including me, think that this theory is
impractical
Then by all means go ahead and retrofit all known filesystems with the extra logic. ALL of the filesystems were written with the understanding that the block layer is telling the truth and that any "white lie" was benign in so much that it would be made good and thus could be assumed to be "truth" for practical purpose.
Xen
2016-05-03 15:47:05 UTC
Permalink
Post by matthew patton
Ceph?!? yeah I don't think so.
Mark's argument was nothing about comparing feature sets or something at
this point. So I don't know what you are responding to. You respond like
a bitten bee.

Read again. Mark Mielke described actual present-day positions. He
described what he thinks is how LVM is positioning itself in conjunction
with and with regards to other solutions in industry. He described that
to his mind the bigger remote storage solutions do not or would not
easily or readily start using LVM for those purposes, while the smaller
scale or more localized systems would.

He described a layering solution, that you seem to be allergic to. He
described a modularized system where thin is being used both at the
remote backend (using a different technology) and at the local end
(using LVM) for different purposes but achieving much of the same
results.

He described how he considered the availability of the remote pool a
responsibility for that remote supplier (and paying good money for it)
while having different uses cases for LVM thin himself or themselves.

And I did think he made a very good case for this. I absolutely believe
his use case is the most dominant and important one for LVM. LVM is for
local systems.

In this case it is a local system running storage on a remote backend.
Yet the local system has different requirements and uses LVM thin for a
different purpose.

And this purpose falls along the lines of having cheap and freely
available snapshots.

And he still feels and believes, apparently, that using the LVM admin
tools for ensuring the stability of his systems might not be the most
attractive and functional thing to do.

You may not agree with that but it is what he believes and feels. It is
a real life data point, if you care about that.

Sometimes people's opinions actually simply just inform you of the
world. It is information. It is not something to fight or disagree with,
it is something to take note of.

The better you are able to respond to these data points, the better you
are aware of the system you are dealing with. That could be real people
paying or not paying you money.

However if you are going to fight every opinion that disagrees with you,
you will never get to the point of actually realizing that they are just
opinions and they are a wealth of information if you'd make use of it.

And that is not a devious thing to do if you're thinking that. It is
being aware. Nothing more, nothing less.

And we are talking about awareness here. Not surprising then that the
people most vehemently opposing this also seem to be the people least
aware of the fact that real people with real uses cases might find the
current situation not practical.

Mr. Zdenek can say all he wants that the current situation is very
practical.

If that is not a data point but an opinion (not of someone experiencing
it, but someone who wants certain people to experience certain things)
then we must listen to actual data points and not what he wants.

Mr. Zdenek (I haven't responded to him here now) also responds like a
bitten bee to simple allusions that Red Hat might be thinking this or
that.

Not just stung by a bee. A bee getting stung ;-).

I mean come on people. You have nothing to lose. Either it is a good
idea or it isn't. If it gets support, maybe someone will implement it
and deliver proof of concept. But if you go about shooting it down the
moment it rears its ugly (or beautiful) head you also ensure that that
developer time is not going to be spend on it even if it were an asset
to you.

Someone discussing a need might not always be that person that in the
end is not going to do anything himself.

You are trying to avoid work but in doing so you avoid work being done
for you as well.

It's give or take, it's plus plus.

Don't kill other people's ideas and maybe they start doing work for you
too.

Oh yeah. Sorry if I'm being judgmental or belligerent (or pedantic):

The great irony and tragedy of the Linux world is this:




Someone comes with a great idea that he/she believes in and wants to
work on.

They shoot it down.

Next they complain why there are so very few volunteers.



They can ban someone on a mailing list one instant and out loud wonder
how they can attract more interest to their system, the next.




Not unrelated.
Post by matthew patton
sure, but that spells responsible sysadmin. Xen's post implied he
didn't want to be bothered to manage his block layer that magically
the FS' job was to work closely with the block layer to suss out when
it was safe to keep accepting writes. There's an answer to "works
closely with block layer" - it's spelled BTRFS and ZFS.
It is not my block layer. I'm not the fucking system admin.

I can only talk to the FS. Or that might very well be the case for my
purposes here.

It is pretty amazing that any attempt to separate responsibilities in
actuality is met with a rebuttal that insists one use a solution that
mingles everything.

In your ideal world then, everyone is forced to use BTRFS/ZFS because at
least these take the worries away from the software/application
designer.

And you ensure a beautiful world without LVM because it has no purpose.

As as software developer I cannot depend on your magical solution and
assertion that every admin out there is going to be this amazing person
that never makes a mistake.
Post by matthew patton
Responsible usage has nothing to do with single vs multiple customers.
Though Xen broached the 'hosting' example and in the cut-rate hosting
business over-provisioning is rampant. It's not a problem unless the
syadmin drops the ball.
What if I want him to be able to drop the ball and still survive?

What about designing systems that are actually failsafe and resilient?

What about resilience?

What about goodness?

What about quality?

What about good stuff?

Why do you feed your admins bad stuff just so that they can shine and
consider themselves important?
Post by matthew patton
So you're going to write and then backport "second guess the block
layer" code to all filesystems in common use and god knows how many
versions back? Of course not. Just try to get on the EXT developer
mailing list and ask them to write "block layer second-guessing code
(aka branch on device flag=thin)" because THINP will cause problems
for the FS when it runs out of extents. To which the obvious and
correct response will be "Don't use THINP if you're not prepared to
handle it's pre-requisites."
So you are basically suggesting a solution that you know will fail, but
still you recommend it.

That spells out "I don't know how to achieve my goals" like no other
thing.

But you still think people should follow your recommendations.

What you say is completely anemic to how the open source world works.

You do not ask people to do your work for you.

Why do you even insist on recommending that. And then when you (in your
imagination here) do ask those people to do it for you, they refuse. No
small wonder.

Still you consider that a good way to approach things. To depend on
someone else to do your work for you.

Really.

"Of course not. Just try to get on the EXT developer mailing list and
ask them to..."

Yes I am ridiculing you.

You were sincere in saying those words. You ridicule yourself.

Of course you would start designing patches and creating a workable
solution with yourself as the main leader or catalyst of that project.
There is not other way to do things in life. You should know that.
Post by matthew patton
Then by all means go ahead and retrofit all known filesystems with the
extra logic. ALL of the filesystems were written with the
understanding that the block layer is telling the truth and that any
"white lie" was benign in so much that it would be made good and thus
could be assumed to be "truth" for practical purpose.
Maybe we should also retrofit all unknown filesystems and those that
might be designed on different planets. Yeah, that would be a good way
to approach things.

I really want to follow your recommendations here. If I do, I will have
good chances of achieving success.
Mark Mielke
2016-05-04 00:56:40 UTC
Permalink
Post by matthew patton
<quote>
very small use case in reality. I think large service
providers would use Ceph or EMC or NetApp, or some such
technology to provision large amounts of storage per
customer, and LVM would be used more at the level of a
single customer, or a single machine.
</quote>
Ceph?!? yeah I don't think so.
I don't use Ceph myself. I only listed it as it may be more familiar to
others, and because I was responding to a Red Hat engineer. We use NetApp
and EMC for the most part.
Post by matthew patton
If you thin-provision an EMC/Netapp volume and the block device runs out
of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE.
They don't even go RO. Poof, they disappear. Why? Because there is no
guarantee that every NFS client, every iSCSI client, every FC client is
going to do the right thing. The only reliable means of telling everyone
"shit just broke" is for the asset to disappear.
I think you are correct. Based upon experience, I don't recall this ever
happening, but upon reflection, it may just be that our IT team always
caught the situation before it became too bad, and either extended the
storage, or asked permission to delete snapshots.
Post by matthew patton
All in-flight writes to the volume that the array ACK'd are still good
even if they haven't been de-staged to the intended device thanks to NVRAM
and the array's journal device.
Right. A good feature. An outage occurs, but the data that was properly
written stays written.


<quote>
Post by matthew patton
In these cases, I
would expect that LVM thin volumes should not be used across
multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation
that would be required.
</quote>
sure, but that spells responsible sysadmin. Xen's post implied he didn't
want to be bothered to manage his block layer that magically the FS' job
was to work closely with the block layer to suss out when it was safe to
keep accepting writes. There's an answer to "works closely with block
layer" - it's spelled BTRFS and ZFS.
I get a bit lost here in the push towards BTRFS and ZFS for people with
these expectations as I see BTRFS and ZFS as having a similar problem. They
can both still fill up. They just might get closer to 100% utilization
before they start to fail.

My use case isn't about reaching closer to 100% utilization. For example,
when I first proposed our LVM thinp model for dealing with host-side
snapshots, there were people in my team that felt that "fstrim" should be
run very frequently (even every 15 minutes!), so as to make maximum use of
the available free space across multiple volumes and reduce churn captured
in snapshots. I think anybody with this perspective really should be
looking at BTRFS or ZFS. Myself, I believe fstrim should run once a week or
less, and not really to save space, but more to hint to the flash device
which blocks are definitely not in use over time, to make the best use of
the flash storage over time. If we start to pass 80%, I raise the alarm
that we need to consider increasing the local storage, or moving more
content out of the thin volumes. Usually we find out that more-than-normal
churn occurred, and we just need to prune a few snapshots to drop below 50%
again. I still made them move the content that doesn't need to be snapshot
out of the thin volume, and to a stand-alone LVM thick volume so as to
entirely eliminate this churn from being trapped in snapshots and
accumulating.


LVM has no obligation to protect careless sysadmins doing dangerous things
Post by matthew patton
from themselves. There is nothing wrong with using THIN every which way you
want just as long as you understand and handle the eventuality of extent
exhaustion. Even thin snaps go invalid if it needs to track a change and
can't allocate space for the 'copy'.
Right.
Post by matthew patton
Amazon would make sure to have enough storage to meet my requirement if I need them.
Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools
to manage the fact they are thin-provisoning and to make damn sure they can
cash the checks they are writing.
Right.
Post by matthew patton
Post by Zdenek Kabelac
the nature of the block device, such as "how much space
do you *really* have left?"
So you're going to write and then backport "second guess the block layer"
code to all filesystems in common use and god knows how many versions back?
Of course not. Just try to get on the EXT developer mailing list and ask
them to write "block layer second-guessing code (aka branch on device
flag=thin)" because THINP will cause problems for the FS when it runs out
of extents. To which the obvious and correct response will be "Don't use
THINP if you're not prepared to handle it's pre-requisites."
Bad things happen. Sometimes they happen very quickly. I don't intend to
dare fate, but if fate comes knocking, I prefer to be prepared. For
example, we had two monitoring systems in place for one particularly
critical piece of storage, where the application is particularly poor at
dealing with "out of space". No thin volumes in use here. Thick volumes all
the way. The system on the storage appliance stopped sending notifications
a few weeks prior as a result of some mistake during a reconfiguration or
upgrade. The separate monitoring system using entirely different software
and configuration, on different host, also failed for a different reason
that I no longer recall. The volume became full, and the application data
was corrupted in a bad way that required recovery. My immediate reaction
after best addressing the corruption, was to demand three monitoring
systems instead of two. :-)
Post by matthew patton
Post by Zdenek Kabelac
you and the other people. You think the block storage should
be as transparent as possible, as if the storage was not
thin. Others, including me, think that this theory is
impractical
Then by all means go ahead and retrofit all known filesystems with the
extra logic. ALL of the filesystems were written with the understanding
that the block layer is telling the truth and that any "white lie" was
benign in so much that it would be made good and thus could be assumed to
be "truth" for practical purpose.
I think this relates more closely to your other response, that I will
respond to separately...
--
Mark Mielke <***@gmail.com>
Xen
2016-05-03 18:19:21 UTC
Permalink
Post by Zdenek Kabelac
It's not 'continued' suggestion.
It's just the example of solution where 'filesystem & block layer'
are tied together. Every solution has some advantages and
disadvantages.
So what if more systems were tied together in that way? What would be
the result?

Tying together does not have to do away with layers.

It is not either/or, it is both/and.

You can have separate layers and you can have intregration.

In practice all it would require is for the LVM, ext and XFS people to
agree.

You could develop extensions to the existing protocols that are only
used if both parties understand it.

Then pretty much btrfs has no raison d'ĂȘtre anymore. You would have an
integrated system but people can retain their own identities as much as
they want.

From what you say LVM+ext4/XFS is already a partner system anyway.

It is CLEAR LVM+BTRFS or LVM+ZFS is NOT a popular system.

You can and you could but it does not synergize. OpenSUSE uses btrfs by
default and I guess they use LVM just as well. For LVM you want a
simpler filesystem that does its own work.

(At the same time I am not so happy with the RAID capability of LVM, nor
do I care much at this point).

LVM raid seems for me the third solution after firmware raid, regular
dmraid and .... and that.

I prefer to use LVM on top of raid really. But maybe that's not very
helpful.
Post by Zdenek Kabelac
So far I'm convinced layered design gives user more freedom - for the
price
of bigger space usage.
Well let's stop directing people to btrfs then.

Linux people have a tendency and habit to send people from pillar to
post.

You know what that means.

It means 50% of answers you get are redirects.

They think it's efficient to spend their time redirecting you or wasting
your time in other ways, rather than using the same time and energy
answering your question.

If the social Linux system was a filesystem, people would run benchmarks
and complain that its organisation is that of a lunatic.

Where 50% of read requests get directed to another sector, of which 50%
again get redirected, and all for no purpose really.

Write requests get 90% deflected. The average number of write requests
before you hit your target is about ... it converges exactly to 10.

If I had been better at math I would have known that :p.

You say:

"Please don't compare software to real life".

No, let's compare the social world to technology. We have very bad
technology if you look at it like that. Which in turn doesn't make the
"real" technology much better.



SUM( i * p * (1-p)^(i-1) ) with i = (1, inf) = 1/p.

with p a chance of success at each hit.

The sum of that formula with i iterating from 1 to infinite is 1/p.

With a hit chance of 90% per attempt, the average number of hits to be
successful is 1/.9 = 10/9.

I'm not very brilliant today.
matthew patton
2016-05-04 14:55:33 UTC
Permalink
I get a bit lost here in the push towards BTRFS and ZFS for people with these expectations as
I see BTRFS and ZFS as having a similar problem. They can both still fill up.
Well of course everything fills up eventually. BTRFS and ZFS are integrated systems where the FS can see into the block layer and "do" block layer activities vs the clear demarcation between XFS/EXT and LVM/MD.

If you write too much to a Thin FS today you get serious data loss. Oh sure, the metadata might have landed but the file contents sure didn't. Somebody (you?) mentioned how you seemingly were able to write 4x90GB to a 300GB block device and the FS fsck'd successfully. This doesn't happen in BTRFS/ZFS and friends. At 300.001GB you would have gotten a write error and the write operation would not have succeeded.
Continue reading on narkive:
Loading...