[linux-lvm] Lvm think provisioning query

Discussion:

Bhasker C V

2016-04-27 12:33:05 UTC

Hi,

I am starting with investigating about the lvm thin provisioning
(repeat post from
https://lists.debian.org/debian-user/2016/04/msg00852.html )
(apologies for html mail)

I have done the following

1.Create a PV
vdb 252:16 0 10G 0 disk
ââvdb1 252:17 0 100M 0 part
ââvdb2 252:18 0 9.9G 0 part
***@vmm-deb:~# pvcreate /dev/vdb1
Physical volume "/dev/vdb1" successfully created.
***@vmm-deb:~# pvs
PV VG Fmt Attr PSize PFree
/dev/vdb1 lvm2 --- 100.00m 100.00m

2. create a VG
***@vmm-deb:~# vgcreate virtp /dev/vdb1
Volume group "virtp" successfully created
***@vmm-deb:~# vgs
VG #PV #LV #SN Attr VSize VFree
virtp 1 0 0 wz--n- 96.00m 96.00m

3. create a lv pool and a over-provisioned volume inside it
***@vmm-deb:~# lvcreate -n virtpool -T virtp/virtpool -L40M
Logical volume "virtpool" created.
***@vmm-deb:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-a-tz-- 40.00m 0.00 0.88

***@vmm-deb:~# lvcreate -V1G -T virtp/virtpool -n vol01
WARNING: Sum of all thin volume sizes (1.00 GiB) exceeds the size of thin
pool virtp/virtpool and the size of whole volume group (96.00 MiB)!
For thin pool auto extension activation/thin_pool_autoextend_threshold
should be below 100.
Logical volume "vol01" created.
***@vmm-deb:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-aotz-- 40.00m 0.00 0.98

vol01 virtp Vwi-a-tz-- 1.00g virtpool 0.00

---------- Now the operations
# dd if=/dev/urandom of=./fil status=progress
90532864 bytes (91 MB, 86 MiB) copied, 6.00005 s, 15.1 MB/s^C
188706+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 6.42704 s, 15.0 MB/s

# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/virtp-vol01 976M 95M 815M 11% /tmp/x
# sync
# cd ..
***@vmm-deb:/tmp# umount x
***@vmm-deb:/tmp# fsck.ext4 -f -C0 /dev/virtp/vol01
e2fsck 1.43-WIP (15-Mar-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/virtp/vol01: 12/65536 files (8.3% non-contiguous), 36544/262144 blocks

<mount>
# du -hs fil
93M fil

# dd if=./fil of=/dev/null status=progress
188705+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 0.149194 s, 648 MB/s

# vgs
VG #PV #LV #SN Attr VSize VFree
virtp 1 2 0 wz--n- 96.00m 48.00m

Definetly the file is occupying 90+MB.

What i expect is that the pool is 40M and the file must NOT exceed 40M.
Where does the file get 93M space ?
I know the VG is 96M but the pool created was max 40M (also VG still says
48M free). Is the file exceeding the boundaries ?
or am I doing anything wrong ?

Zdenek Kabelac

2016-04-27 14:33:19 UTC

Permalink

Post by Bhasker C V
Hi,
I am starting with investigating about the lvm thin provisioning
(repeat post from https://lists.debian.org/debian-user/2016/04/msg00852.html )
(apologies for html mail)
I have done the following
1.Create a PV
vdb 252:16 0 10G 0 disk
├─vdb1 252:17 0 100M 0 part
└─vdb2 252:18 0 9.9G 0 part
Physical volume "/dev/vdb1" successfully created.
PV VG Fmt Attr PSize PFree
/dev/vdb1 lvm2 --- 100.00m 100.00m
2. create a VG
Volume group "virtp" successfully created
VG #PV #LV #SN Attr VSize VFree
virtp 1 0 0 wz--n- 96.00m 96.00m
3. create a lv pool and a over-provisioned volume inside it
Logical volume "virtpool" created.
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-a-tz-- 40.00m 0.00 0.88
WARNING: Sum of all thin volume sizes (1.00 GiB) exceeds the size of thin
pool virtp/virtpool and the size of whole volume group (96.00 MiB)!
For thin pool auto extension activation/thin_pool_autoextend_threshold
should be below 100.
Logical volume "vol01" created.
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-aotz-- 40.00m 0.00 0.98
vol01 virtp Vwi-a-tz-- 1.00g virtpool 0.00
---------- Now the operations
# dd if=/dev/urandom of=./fil status=progress
90532864 bytes (91 MB, 86 MiB) copied, 6.00005 s, 15.1 MB/s^C
188706+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 6.42704 s, 15.0 MB/s
# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/virtp-vol01 976M 95M 815M 11% /tmp/x
# sync
# cd ..
e2fsck 1.43-WIP (15-Mar-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/virtp/vol01: 12/65536 files (8.3% non-contiguous), 36544/262144 blocks
<mount>
# du -hs fil
93M fil
# dd if=./fil of=/dev/null status=progress
188705+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 0.149194 s, 648 MB/s
# vgs
VG #PV #LV #SN Attr VSize VFree
virtp 1 2 0 wz--n- 96.00m 48.00m
Definetly the file is occupying 90+MB.
What i expect is that the pool is 40M and the file must NOT exceed 40M. Where
does the file get 93M space ?
I know the VG is 96M but the pool created was max 40M (also VG still says 48M
free). Is the file exceeding the boundaries ?
or am I doing anything wrong ?

Hi

Answer is simple -> nowhere - they are simply lost - check your kernel dmesg
log - you will spot lost of async write error.
(page cache is tricky here... - dd ends just in page-cache which is later
asynchronously sync to disk)

There is also 60s delay before thin-pool target starts to error all queued
write operations if there is not enough space in pool.

So whenever you write something and you want to be 100% 'sure' it landed on
disk you have to 'sync' your writes.

i.e.
dd if=/dev/urandom of=./fil status=progress conf=fsync

and if you want to know 'exactly' what's the error place -

dd if=/dev/urandom of=./fil status=progress oflags=direct

Regards

Zdenek

Bhasker C V

2016-04-28 14:36:23 UTC

Permalink

Zdenek,
Thanks. Here I am just filling it up with random data and so I am not
concerned about data integrity
You are right, I did get page lost during write errors in the kernel

The question however is even after reboot and doing several fsck of the
ext4fs the file size "occupied" is more than the pool size. How is this ?
I agree that data may be corrupted, but there *is* some data and this must
be saved somewhere. Why is this "somewhere" exceeding the pool size ?

Post by Zdenek Kabelac

Post by Bhasker C V
Hi,
I am starting with investigating about the lvm thin provisioning
(repeat post from
https://lists.debian.org/debian-user/2016/04/msg00852.html )
(apologies for html mail)
I have done the following
1.Create a PV
vdb 252:16 0 10G 0 disk
ââvdb1 252:17 0 100M 0 part
ââvdb2 252:18 0 9.9G 0 part
Physical volume "/dev/vdb1" successfully created.
PV VG Fmt Attr PSize PFree
/dev/vdb1 lvm2 --- 100.00m 100.00m
2. create a VG
Volume group "virtp" successfully created
VG #PV #LV #SN Attr VSize VFree
virtp 1 0 0 wz--n- 96.00m 96.00m
3. create a lv pool and a over-provisioned volume inside it
Logical volume "virtpool" created.
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-a-tz-- 40.00m 0.00 0.88
WARNING: Sum of all thin volume sizes (1.00 GiB) exceeds the size of thin
pool virtp/virtpool and the size of whole volume group (96.00 MiB)!
For thin pool auto extension activation/thin_pool_autoextend_threshold
should be below 100.
Logical volume "vol01" created.
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-aotz-- 40.00m 0.00 0.98
vol01 virtp Vwi-a-tz-- 1.00g virtpool 0.00
---------- Now the operations
# dd if=/dev/urandom of=./fil status=progress
90532864 bytes (91 MB, 86 MiB) copied, 6.00005 s, 15.1 MB/s^C
188706+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 6.42704 s, 15.0 MB/s
# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/virtp-vol01 976M 95M 815M 11% /tmp/x
# sync
# cd ..
e2fsck 1.43-WIP (15-Mar-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/virtp/vol01: 12/65536 files (8.3% non-contiguous), 36544/262144 blocks
<mount>
# du -hs fil
93M fil
# dd if=./fil of=/dev/null status=progress
188705+0 records in
188705+0 records out
96616960 bytes (97 MB, 92 MiB) copied, 0.149194 s, 648 MB/s
# vgs
VG #PV #LV #SN Attr VSize VFree
virtp 1 2 0 wz--n- 96.00m 48.00m
Definetly the file is occupying 90+MB.
What i expect is that the pool is 40M and the file must NOT exceed 40M. Where
does the file get 93M space ?
I know the VG is 96M but the pool created was max 40M (also VG still says 48M
free). Is the file exceeding the boundaries ?
or am I doing anything wrong ?

Hi
Answer is simple -> nowhere - they are simply lost - check your kernel
dmesg log - you will spot lost of async write error.
(page cache is tricky here... - dd ends just in page-cache which is
later asynchronously sync to disk)
There is also 60s delay before thin-pool target starts to error all queued
write operations if there is not enough space in pool.
So whenever you write something and you want to be 100% 'sure' it landed
on disk you have to 'sync' your writes.
i.e.
dd if=/dev/urandom of=./fil status=progress conf=fsync
and if you want to know 'exactly' what's the error place -
dd if=/dev/urandom of=./fil status=progress oflags=direct
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Zdenek Kabelac

2016-04-29 08:13:40 UTC

Permalink

Hi

Few key principles -

1. You should always mount extX fs with errors=remount-ro (tune2fs,mount)

2. There are few data={} modes ensuring various degree of data integrity,
An case you really care about data integrity here - switch to 'journal'
mode at price of lower speed. Default ordered mode might show this.
(i.e. it's the very same behavior as you would have seen with failing hdd)

3. Do not continue using thin-pool when it's full :)

4. We do miss more configurable policies with thin-pools.
i.e. do plan to instantiate 'error' target for writes in the case
pool gets full - so ALL writes will be errored - as of now - writes
to provisioned blocks may cause further filesystem confusion - that's
why 'remount-ro' is rather mandatory - xfs is recently being enhanced
to provide similar logic.

Regards

Zdenek

Bhasker C V

2016-05-03 06:59:17 UTC

Permalink

Does this mean the ext4 is showing wrong information. The file is reported
being 90+MB but in actuality the size is less in the FS ?
This is quite ok because it is just that file system being affected. I was
however concerned that the file in this FS might have overwritten other LV
data since the file is showing bigger than the volume size.

I will try this using BTRFS.

Post by Zdenek Kabelac

Hi
Few key principles -
1. You should always mount extX fs with errors=remount-ro (tune2fs,mount)
2. There are few data={} modes ensuring various degree of data integrity,
An case you really care about data integrity here - switch to 'journal'
mode at price of lower speed. Default ordered mode might show this.
(i.e. it's the very same behavior as you would have seen with failing hdd)
3. Do not continue using thin-pool when it's full :)
4. We do miss more configurable policies with thin-pools.
i.e. do plan to instantiate 'error' target for writes in the case
pool gets full - so ALL writes will be errored - as of now - writes
to provisioned blocks may cause further filesystem confusion - that's
why 'remount-ro' is rather mandatory - xfs is recently being enhanced
to provide similar logic.
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Zdenek Kabelac

2016-05-03 09:54:56 UTC

Permalink

Post by Bhasker C V
Does this mean the ext4 is showing wrong information. The file is reported
being 90+MB but in actuality the size is less in the FS ?
This is quite ok because it is just that file system being affected. I was
however concerned that the file in this FS might have overwritten other LV
data since the file is showing bigger than the volume size.

I've no idea what 'ext4' is showing you, but if you have i.e. 100M filesystem
size, you could still have there e.g. 1TB file. Experience the magic:

'truncate -s 1T myfirst1TBfile'

As you can see 'ext4' is doing it's own over-provisioning with 'hole' files.
The only important bits are:
- is the filesystem consistent ?
- is 'fsck' not reporting any error ?

What's the 'real' size you get with 'du myfirst1TBfile' or your wrong file ?

Somehow I don't believe you can get i.e. 90+MB 'du' size with 10MB
filesystem size and 'fsck' would not report any problem.

Post by Bhasker C V
I will try this using BTRFS.

For what exactly ??

Regard

Zdenek

Post by Bhasker C V
Zdenek,
Thanks. Here I am just filling it up with random data and so I am not
concerned about data integrity
You are right, I did get page lost during write errors in the kernel
The question however is even after reboot and doing several fsck of
the ext4fs
the file size "occupied" is more than the pool size. How is this ?
I agree that data may be corrupted, but there *is* some data and this
must be
saved somewhere. Why is this "somewhere" exceeding the pool size ?
Hi
Few key principles -
1. You should always mount extX fs with errors=remount-ro (tune2fs,mount)
2. There are few data={} modes ensuring various degree of data integrity,
An case you really care about data integrity here - switch to 'journal'
mode at price of lower speed. Default ordered mode might show this.
(i.e. it's the very same behavior as you would have seen with failing hdd)
3. Do not continue using thin-pool when it's full :)
4. We do miss more configurable policies with thin-pools.
i.e. do plan to instantiate 'error' target for writes in the case
pool gets full - so ALL writes will be errored - as of now - writes
to provisioned blocks may cause further filesystem confusion - that's
why 'remount-ro' is rather mandatory - xfs is recently being enhanced
to provide similar logic.
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Bhasker C V

2016-05-03 12:21:27 UTC

Permalink

Here are the answers to your questions

1. fsck does not report any error and the file contained inside the FS is
definitely greater than the allocatable LV size
# fsck.ext4 -f -C0 /dev/virtp/vol01
e2fsck 1.43-WIP (15-Mar-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory
structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/virtp/vol01: 12/65536 files (8.3% non-contiguous), 30492/262144
blocks

2. Size of the file

# du -hs fil
69M fil

(please note here that the LV virtual size is 1G but the parent pool size
is just 40M I expect the file not to exceed 40M at any cost.)

3. lvs
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-aotzD- 40.00m 100.00
1.37
vol01 virtp Vwi-aotz-- 1.00g
virtpool

You can do this on any virtual machine. I use qemu with virtio back-end.

Post by Zdenek Kabelac

I've no idea what 'ext4' is showing you, but if you have i.e. 100M
filesystem size, you could still have there e.g. 1TB file. Experience the
'truncate -s 1T myfirst1TBfile'
As you can see 'ext4' is doing it's own over-provisioning with 'hole' files.
- is the filesystem consistent ?
- is 'fsck' not reporting any error ?
What's the 'real' size you get with 'du myfirst1TBfile' or your wrong file ?
Somehow I don't believe you can get i.e. 90+MB 'du' size with 10MB
filesystem size and 'fsck' would not report any problem.
I will try this using BTRFS.
For what exactly ??
Regard
Zdenek

Post by Bhasker C V
Zdenek,
Thanks. Here I am just filling it up with random data and so I am not
concerned about data integrity
You are right, I did get page lost during write errors in the kernel
The question however is even after reboot and doing several fsck of
the ext4fs
the file size "occupied" is more than the pool size. How is this ?
I agree that data may be corrupted, but there *is* some data and this
must be
saved somewhere. Why is this "somewhere" exceeding the pool size ?
Hi
Few key principles -
1. You should always mount extX fs with errors=remount-ro
(tune2fs,mount)
2. There are few data={} modes ensuring various degree of data integrity,
An case you really care about data integrity here - switch to 'journal'
mode at price of lower speed. Default ordered mode might show this.
(i.e. it's the very same behavior as you would have seen with failing hdd)
3. Do not continue using thin-pool when it's full :)
4. We do miss more configurable policies with thin-pools.
i.e. do plan to instantiate 'error' target for writes in the case
pool gets full - so ALL writes will be errored - as of now - writes
to provisioned blocks may cause further filesystem confusion - that's
why 'remount-ro' is rather mandatory - xfs is recently being enhanced
to provide similar logic.
Regards
Zdenek
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Zdenek Kabelac

2016-05-03 14:49:46 UTC

Permalink

Post by Bhasker C V
Here are the answers to your questions
1. fsck does not report any error and the file contained inside the FS is
definitely greater than the allocatable LV size
# fsck.ext4 -f -C0 /dev/virtp/vol01
e2fsck 1.43-WIP (15-Mar-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/virtp/vol01: 12/65536 files (8.3% non-contiguous), 30492/262144 blocks
2. Size of the file
# du -hs fil
69M fil
(please note here that the LV virtual size is 1G but the parent pool size is
just 40M I expect the file not to exceed 40M at any cost.)
3. lvs
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log
Cpy%Sync Convert
virtpool virtp twi-aotzD- 40.00m 100.00 1.37
vol01 virtp Vwi-aotz-- 1.00g virtpool
You can do this on any virtual machine. I use qemu with virtio back-end.

But this is VERY different case.

You filesystem IS 1GB in size and ext4 provisions mostly all 'metadatata'
during first mount.

So thin-pool has usually all filesystem's metadata space 'available' for
updating and if you use mount option data=ordered (being default) - it
happens that 'write' to provisioned space is OK, while write to 'data' space
gets async page lost.

And this all depends how are you willing to write your data.

Basically if you use page-cache and ignore 'fdatasync()' you NEVER know what
has been stored in disk (living in a dreamworld basically)
(i.e. close of your program/file descriptor DOES NOT flush)

When thin-pool gets full and you have not managed to resize your data LV
in-time various thing may go wrong - this is a fuzzy tricky land.

Now few people (me included) believe thin volume should error 'ANY' further
write when there was an overprovisioning error on a device and I'm afraid this
can't be solved elsewhere then in target driver.
ATM this thin volume puts filesystem into very complex situation which does
not have 'winning' scenario in number of cases - so we need to define number
of policies.

BUT ATM we clearly communicate - when you run OUT of thin-pool space
it's serious ADMIN failure - and we could only try to lower damage.

Thin-pool overfull CANNOT be compared to writing to a full filesystem
and there is absolutely no guarantee about content of non-flushed files!

Expecting you run out-of-space in thin-pool and nothing bad can happens is
naive ATM - we are cooperating at least with XFS/ext4 developers to solve some
corner case, but there is still a lot of work to do as we exercise quite
unusual error paths for them.

Zdenek

Xen

2016-05-03 15:51:34 UTC

Permalink

Post by Zdenek Kabelac
Expecting you run out-of-space in thin-pool and nothing bad can
happens is naive ATM - we are cooperating at least with XFS/ext4
developers to solve some corner case, but there is still a lot of work
to do as we exercise quite unusual error paths for them.

You also talked about seeing if you could have these filesystems work
more in alignment with block (extent) boundaries, right?

I mean something that agrees more with allocation requests, so to speak.

Zdenek Kabelac

2016-05-03 16:27:16 UTC

Permalink

You also talked about seeing if you could have these filesystems work more in
alignment with block (extent) boundaries, right?

Yes it's mostly about 'space' efficiency.

i.e. it's inefficient to provision 1M thin-pool chunks and then filesystem
uses just 1/2 of this provisioned chunk and allocates next one.
The smaller the chunk is the better space efficiency gets (and need with
snapshot), but may need lots of metadata and may cause fragmentation troubles.

ATM thin-pool support a single chunksize - so again up to admin to pick the
right one for its needs.

For Read/Write alignment still the physical geometry is the limiting factor.

Zdenek

Gionatan Danti

2016-05-03 17:07:36 UTC

Permalink

Post by Zdenek Kabelac
Now few people (me included) believe thin volume should error 'ANY'
further write when there was an overprovisioning error on a device and
I'm afraid this can't be solved elsewhere then in target driver.
ATM this thin volume puts filesystem into very complex situation which
does not have 'winning' scenario in number of cases - so we need to
define number of policies.
BUT ATM we clearly communicate - when you run OUT of thin-pool space
it's serious ADMIN failure - and we could only try to lower damage.
Thin-pool overfull CANNOT be compared to writing to a full filesystem
and there is absolutely no guarantee about content of non-flushed files!

True, but non-synched writes should be always treated as "this item can
be lost if power disappears / system crashes" anyway. On the other
hands, (f)synched writes should already fail immediately if no space can
be allocated from the storage subsystem.

In other words, even with a full data pool filesystem intergrity by
itself should be guaranteed (both by jornaling and fsync), while non
flushed writes "maybe" (if the data segment required was *already*
allocated the writes completes, otherwise it fail as an async lost
page).

For full tmeta things are much worse, as sometime it require
thin_repair. (ps: if you have two free minutes, please see my other
email regarding full tmeta. Thanks in advance).

This is my current understanding; please correct me if I am wrong!

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: ***@assyoma.it - ***@assyoma.it
GPG public key ID: FF5F32A8