[linux-lvm] Repair thin pool

Discussion:

Mars

2016-02-05 01:21:46 UTC

Hi there,

We're using Centos 7.0 with lvm 2.02.105 and met a problem as underlying:
After a electricity powerdown in the datacenter room, thin provision
volumes came up with wrong states:

*[***@storage ~]# lvs -a*
* dm_report_object: report function failed for field data_percent*
* LV VG Attr LSize
Pool Origin Data% Move Log Cpy%Sync Convert*
* DailyBuild vgg145155121036c Vwi-d-tz-- 5.00t
pool_nas *
* dat vgg145155121036c Vwi-d-tz-- 10.00t
pool_nas *
* lvol0 vgg145155121036c -wi-a-----
15.36g *
* [lvol3_pmspare] vgg145155121036c ewi------- 15.27g*
* market vgg145155121036c Vwi-d-tz-- 3.00t
pool_nas *
* pool_nas vgg145155121036c twi-a-tz--
14.90t 0.00 *
* [pool_nas_tdata] vgg145155121036c Twi-ao----
14.90t *
* [pool_nas_tmeta] vgg145155121036c ewi-ao---- 15.27g

*
* share vgg145155121036c Vwi-d-tz-- 10.00t
pool_nas*

the thin pool "pool_nas" and general lv "lvol0" are active, but thin
provision volumes cannot be actived even with cmd "lvchange -ay
thin_volume_name".

To recover it, we tried following ways refer to these mail conversations:
http://www.spinics.net/lists/lvm/msg22629.html and
http://comments.gmane.org/gmane.linux.lvm.general/14828.

1, USE: "lvconvert --repair vgg145155121036c/pool_nas"
output as below and thin volumes still cannot be active.
WARNING: If everything works, remove "vgg145155121036c/pool_nas_tmeta0".
WARNING: Use pvmove command to move "vgg145155121036c/pool_nas_tmeta" on
the best fitting PV.

2, USE manual repair steps:
2a: inactive thin pool.
2b: create a temp lv "metabak".
2c: swap the thin pool's metadata lv: "lvconvert --thinpool vgg145155121036
c/pool_nas --poolmetadata metabak -y", only with "-y" option can submit the
command.
2d: active temp lv "metabak" and create another bigger lv "metabak1".
2e: repair metadata: "thin_restore -i /dev/vgg145155121036c/metabak-o /dev/
vgg145155121036c/metabak1", and got segment fault.

So, is there any other way to recover this or some steps we do wrong?

Thank you very much.
Mars

M.H. Tsai

2016-02-05 11:44:46 UTC

Permalink

Hi,

Seems that your steps are wrong. You should run thin_repair before
swapping the pool metadata.
Also, thin_restore is for XML(text) input, not for binary metadata
input, so it's normal to get segmentation fault...

"lvconvert --repair ... " is a command wrapping "thin_repair +
swapping metadata" into a single step.
If it doesn't work, then you might need to dump the metadata manually,
to check if there's serious corruption in mapping trees or not....
(I recommend to use the newest thin-provisioning-tools to get better result)

1. active the pool metadata (It's okay if the command failed. We just
want to activate the hidden metadata LV)
lvchange -ay vgg1/pool_nas

2. dump the metadata, then checkout the output XML
thin_dump /dev/mapper/vgg1-pool_nas_tmeta -o thin_dump.xml -r

I have experience in repairing many seriously corrupted thin pools. If
the physical medium is okay, I think that most cases are repairable.
I also wrote some extension to thin-provisioning-tools (not yet
published. the code still need some refinement...), maybe it could
help.

Ming-Hung Tsai

Post by Mars
Hi there,
dm_report_object: report function failed for field data_percent
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
DailyBuild vgg145155121036c Vwi-d-tz-- 5.00t pool_nas
dat vgg145155121036c Vwi-d-tz-- 10.00t pool_nas
lvol0 vgg145155121036c -wi-a----- 15.36g
[lvol3_pmspare] vgg145155121036c ewi------- 15.27g
market vgg145155121036c Vwi-d-tz-- 3.00t pool_nas
pool_nas vgg145155121036c twi-a-tz-- 14.90t 0.00
[pool_nas_tdata] vgg145155121036c Twi-ao---- 14.90t
[pool_nas_tmeta] vgg145155121036c ewi-ao---- 15.27g
share vgg145155121036c Vwi-d-tz-- 10.00t pool_nas
the thin pool "pool_nas" and general lv "lvol0" are active, but thin provision volumes cannot be actived even with cmd "lvchange -ay thin_volume_name".
To recover it, we tried following ways refer to these mail conversations: http://www.spinics.net/lists/lvm/msg22629.html and http://comments.gmane.org/gmane.linux.lvm.general/14828.
1, USE: "lvconvert --repair vgg145155121036c/pool_nas"
output as below and thin volumes still cannot be active.
WARNING: If everything works, remove "vgg145155121036c/pool_nas_tmeta0".
WARNING: Use pvmove command to move "vgg145155121036c/pool_nas_tmeta" on the best fitting PV.
2a: inactive thin pool.
2b: create a temp lv "metabak".
2c: swap the thin pool's metadata lv: "lvconvert --thinpool vgg145155121036c/pool_nas --poolmetadata metabak -y", only with "-y" option can submit the command.
2d: active temp lv "metabak" and create another bigger lv "metabak1".
2e: repair metadata: "thin_restore -i /dev/vgg145155121036c/metabak-o /dev/vgg145155121036c/metabak1", and got segment fault.
So, is there any other way to recover this or some steps we do wrong?
Thank you very much.
Mars
_______________________________________________
linux-lvm mailing list
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Zdenek Kabelac

2016-02-05 15:17:38 UTC

Permalink

Post by M.H. Tsai
Hi,
Seems that your steps are wrong. You should run thin_repair before
swapping the pool metadata.

Nope - actually they were correct.

Post by M.H. Tsai
Also, thin_restore is for XML(text) input, not for binary metadata
input, so it's normal to get segmentation fault...
"lvconvert --repair ... " is a command wrapping "thin_repair +
swapping metadata" into a single step.
If it doesn't work, then you might need to dump the metadata manually,
to check if there's serious corruption in mapping trees or not....
(I recommend to use the newest thin-provisioning-tools to get better result)
1. active the pool metadata (It's okay if the command failed. We just
want to activate the hidden metadata LV)
lvchange -ay vgg1/pool_nas
2. dump the metadata, then checkout the output XML
thin_dump /dev/mapper/vgg1-pool_nas_tmeta -o thin_dump.xml -r

Here is actually what goes wrong.

You should not try to access 'life' metadata (unless you take thin-pool
snapshot of them)

So by using thin-dump on life changed volume you often get 'corruptions'
listed which actually do not exist.

That said - if your thin-pool got 'blocked' for whatever reason
(deadlock?) - reading such data which cannot be changed anymore could provide
the 'best' guess data you could get - so in some cases it depends on use-case
(i.e. you disk is dying and it may not run at all after reboot)...

Post by M.H. Tsai
I have experience in repairing many seriously corrupted thin pools. If
the physical medium is okay, I think that most cases are repairable.
I also wrote some extension to thin-provisioning-tools (not yet
published. the code still need some refinement...), maybe it could
help.

You should always repair data where you are sure they are not changing in
background.

That's why --repair requires currently offline state of thin-pool.
It should do all 'swap' operations in proper order.

Zdenek

M.H. Tsai

2016-02-05 16:12:52 UTC

Permalink

Post by Zdenek Kabelac

Post by M.H. Tsai
Hi,
Seems that your steps are wrong. You should run thin_repair before
swapping the pool metadata.

Nope - actually they were correct.

Here is actually what goes wrong.
You should not try to access 'life' metadata (unless you take thin-pool
snapshot of them)
So by using thin-dump on life changed volume you often get 'corruptions'
listed which actually do not exist.
That said - if your thin-pool got 'blocked' for whatever reason
(deadlock?) - reading such data which cannot be changed anymore could
provide the 'best' guess data you could get - so in some cases it depends on
use-case
(i.e. you disk is dying and it may not run at all after reboot)...
You should always repair data where you are sure they are not changing in
background.
That's why --repair requires currently offline state of thin-pool.
It should do all 'swap' operations in proper order.
Zdenek

Yes, we should repair the metadata when the pool is offline, but LVM
cannot activate a hidden metadata LV. So the easiest way is activating
the entire pool. Maybe we need some option to force activate a hidden
volume, like "lvchange -ay vgg1/pool_nas_tmeta -ff". It's useful for
repairing metadata. Otherwise, we should use dmsetup to manually
create the device.

In my experience, if the metadata had serious problem, then the pool
device usually cannot be created, so the metadata is not accessed by
kernel... Just a coincidence.

Ming-Hung Tsai

Zdenek Kabelac

2016-02-05 17:28:10 UTC

Permalink

Post by M.H. Tsai

Post by Zdenek Kabelac

Post by M.H. Tsai
Hi,
Seems that your steps are wrong. You should run thin_repair before
swapping the pool metadata.

Nope - actually they were correct.

Here is actually what goes wrong.
You should not try to access 'life' metadata (unless you take thin-pool
snapshot of them)
So by using thin-dump on life changed volume you often get 'corruptions'
listed which actually do not exist.
That said - if your thin-pool got 'blocked' for whatever reason
(deadlock?) - reading such data which cannot be changed anymore could
provide the 'best' guess data you could get - so in some cases it depends on
use-case
(i.e. you disk is dying and it may not run at all after reboot)...
You should always repair data where you are sure they are not changing in
background.
That's why --repair requires currently offline state of thin-pool.
It should do all 'swap' operations in proper order.
Zdenek

But that's actually what described 'swap' is for.

You 'replace/swap' existing metadata LV with some selected LV in VG.

Then you activate this LV - and you may do whatever you need to do.
(so you have content of _tmeta LV accessible through your tmp_created_LV)

lvm2 currently doesn't support activation of 'subLVs' as it makes activation
of the whole tree of LVs much more complicated (clvmd support restrictions)

So ATM we take only top-level LV lock in cluster (and yes - there is still
unresolved bug for thin-pool/thinLV - when user may 'try' to activate
different thin LVs from a single thin-pool on multiple nodes - so for now -
there is just one advice - don't do that - until we provide a fix for this.

Post by M.H. Tsai
In my experience, if the metadata had serious problem, then the pool
device usually cannot be created, so the metadata is not accessed by
kernel... Just a coincidence.

So once you i.e. 'repair' metadata from swapped LV to some other LV
you can swap back 'fixed' metadata (and of course there should (and someday
will) be further validation between kernel metadata and lvm2 metadata about
device IDs, transaction IDs, devices sizes....)

This way you may even make metadata smaller if you need to (and select to
large metadata area initially so you not waste space on this LV).

Zdenek

M.H. Tsai

2016-02-06 13:14:04 UTC

Permalink

Post by Zdenek Kabelac
But that's actually what described 'swap' is for.
You 'replace/swap' existing metadata LV with some selected LV in VG.
Then you activate this LV - and you may do whatever you need to do.
(so you have content of _tmeta LV accessible through your tmp_created_LV)

I forget that we can use swapping to make _tmeta visible. The steps in
this page are correct.
http://www.spinics.net/lists/lvm/msg22629.html
The only typo in this post is that we should use thin_repair instead
of thin_restore.
http://permalink.gmane.org/gmane.linux.lvm.general/14829

Post by Zdenek Kabelac
lvm2 currently doesn't support activation of 'subLVs' as it makes
activation of the whole tree of LVs much more complicated (clvmd support
restrictions)
So ATM we take only top-level LV lock in cluster (and yes - there is still
unresolved bug for thin-pool/thinLV - when user may 'try' to activate
different thin LVs from a single thin-pool on multiple nodes - so for now -
there is just one advice - don't do that - until we provide a fix for this.

I didn't try clvm. Thanks for noticing.

Ming-Hung Tsai

Joe Thornber

2016-02-08 08:56:02 UTC

Permalink

Post by M.H. Tsai
I also wrote some extension to thin-provisioning-tools (not yet
published. the code still need some refinement...), maybe it could
help.

I'd definitely like to see what you changed please.

- Joe

M.H. Tsai

2016-02-08 18:03:39 UTC

Permalink

Post by Joe Thornber

Post by M.H. Tsai
I also wrote some extension to thin-provisioning-tools (not yet
published. the code still need some refinement...), maybe it could
help.

I'd definitely like to see what you changed please.
- Joe

I wrote some tools to do "semi-auto" repair, called thin_ll_dump and
thin_ll_restore (low-level dump & restore), that can find orphan nodes
and reconstruct the metadata using orphan nodes. It could cope the cases
that the top-level data mapping tree or some higher-level nodes were
broken, to complement the repairing feature of thin_repair.

Although that users are required to have knowledge about dm-thin metadata
before using these tools (you need to specify which orphan node to use), I
think that these tools are useful for system administrators. Most thin-pool
corruption cases I experienced (caused by power lost, broken disks, RAID
corruption, etc.) cannot be handled by the current thin-provisioning-tools
-- thin_repair is fully automatic, but it just skips broken nodes.
However, those missing mappings could be found in orphan nodes.

Also, I wrote another tool called thin_scan, to show the entire metadata
layout and scan broken nodes. (which is an enhanced version of
thin_show_block in branch low_level_examine_metadata -- I didn't notice
that before... maybe the name thin_show_block sounds more clear?)

What do you think about these features? Are they worth to be merged to the
upstream?

Thanks,
Ming-Hung Tsai

Joe Thornber

2016-02-10 10:32:49 UTC

Permalink

Post by M.H. Tsai

Post by Joe Thornber

Post by M.H. Tsai
I also wrote some extension to thin-provisioning-tools (not yet
published. the code still need some refinement...), maybe it could
help.

I'd definitely like to see what you changed please.
- Joe

I wrote some tools to do "semi-auto" repair, called thin_ll_dump and
thin_ll_restore (low-level dump & restore), that can find orphan nodes
and reconstruct the metadata using orphan nodes. It could cope the cases
that the top-level data mapping tree or some higher-level nodes were
broken, to complement the repairing feature of thin_repair.
Although that users are required to have knowledge about dm-thin metadata
before using these tools (you need to specify which orphan node to use), I
think that these tools are useful for system administrators. Most thin-pool
corruption cases I experienced (caused by power lost, broken disks, RAID
corruption, etc.) cannot be handled by the current thin-provisioning-tools
-- thin_repair is fully automatic, but it just skips broken nodes.
However, those missing mappings could be found in orphan nodes.
Also, I wrote another tool called thin_scan, to show the entire metadata
layout and scan broken nodes. (which is an enhanced version of
thin_show_block in branch low_level_examine_metadata -- I didn't notice
that before... maybe the name thin_show_block sounds more clear?)
What do you think about these features? Are they worth to be merged to the
upstream?

Yep, I definitely want these for upstream. Send me what you've got,
whatever state it's in; I'll happily spend a couple of weeks tidying
this.

- Joe

M.H. Tsai

2016-02-14 08:54:56 UTC

Permalink

Post by Joe Thornber
Yep, I definitely want these for upstream. Send me what you've got,
whatever state it's in; I'll happily spend a couple of weeks tidying
this.
- Joe

M.H. Tsai

2016-02-06 14:10:59 UTC

Permalink

Hi,

Let we review your question again. You had run "lvconvert --repair",
so now the volume pool_nas_tmeta0 is the original metadata (if you
didn't swap the metadata again). You can run thin_check and thin_dump
on pool_nas_tmeta0 to know why thin_repair doesn't work.

thin_check /dev/mapper/vgg145155121036c-pool_nas_tmeta0 > thin_check.log 2>&1
thin_dump /dev/mapper/vgg145155121036c-pool_nas_tmeta0 -o thin_dump.xml -r

Ming-Hung Tsai

Post by Mars
Hi there,
After a electricity powerdown in the datacenter room, thin provision volumes
dm_report_object: report function failed for field data_percent
LV VG Attr LSize Pool
Origin Data% Move Log Cpy%Sync Convert
DailyBuild vgg145155121036c Vwi-d-tz-- 5.00t
pool_nas
dat vgg145155121036c Vwi-d-tz-- 10.00t
pool_nas
lvol0 vgg145155121036c -wi-a----- 15.36g
[lvol3_pmspare] vgg145155121036c ewi------- 15.27g
market vgg145155121036c Vwi-d-tz-- 3.00t
pool_nas
pool_nas vgg145155121036c twi-a-tz-- 14.90t
0.00
[pool_nas_tdata] vgg145155121036c Twi-ao---- 14.90t
[pool_nas_tmeta] vgg145155121036c ewi-ao---- 15.27g
share vgg145155121036c Vwi-d-tz-- 10.00t
pool_nas
the thin pool "pool_nas" and general lv "lvol0" are active, but thin
provision volumes cannot be actived even with cmd "lvchange -ay
thin_volume_name".
http://www.spinics.net/lists/lvm/msg22629.html and
http://comments.gmane.org/gmane.linux.lvm.general/14828.
1, USE: "lvconvert --repair vgg145155121036c/pool_nas"
output as below and thin volumes still cannot be active.
WARNING: If everything works, remove "vgg145155121036c/pool_nas_tmeta0".
WARNING: Use pvmove command to move "vgg145155121036c/pool_nas_tmeta" on the
best fitting PV.
2a: inactive thin pool.
2b: create a temp lv "metabak".
2c: swap the thin pool's metadata lv: "lvconvert --thinpool
vgg145155121036c/pool_nas --poolmetadata metabak -y", only with "-y" option
can submit the command.
2d: active temp lv "metabak" and create another bigger lv "metabak1".
2e: repair metadata: "thin_restore -i /dev/vgg145155121036c/metabak-o
/dev/vgg145155121036c/metabak1", and got segment fault.
So, is there any other way to recover this or some steps we do wrong?
Thank you very much.
Mars

Mars

2016-02-17 02:48:23 UTC

Permalink

Post by Joe Thornber
Yep, I definitely want these for upstream. Send me what you've got,
whatever state it's in; I'll happily spend a couple of weeks tidying
this.
- Joe

The feature was completed & workable, but the code is based on v0.4.1.

I need some days to clean up & rebase. Please wait.

syntax:

thin_ll_dump /dev/mapper/corrupted_tmeta [-o thin_ll_dump.xml]

thin_ll_restore -i edited_thin_ll_dump.xml -E

/dev/mapper/corrupted_tmeta -o /dev/mapper/fixed_tmeta

Ming-Hung Tsai

-------------

Hi,

Thank you very much for giving us so many advices.

Here are some progresses based on you guys mail conversations:

1,check metadata device:

[***@stor14 home]# thin_check /dev/mapper/vgg145155121036c-pool_nas_tmeta0
examining superblock
examining devices tree
examining mapping tree

2,dump metadata info:

[***@stor14 home]# thin_dump
/dev/mapper/vgg145155121036c-pool_nas_tmeta0 -o nas_thin_dump.xml -r
[***@stor14 home]# cat nas_thin_dump.xml
<superblock uuid="" time="1787" transaction="3545"
data_block_size="128" nr_data_blocks="249980672">
</superblock>

Compared with other normal pools, it seems like all device nodes and
mapping info in the metadata lv have lost.

Is there happened to be 'orphan nodes'? and could you give us your
semi-auto repair tools so we can repair it?

Thank you very much!

Mars

M.H. Tsai

2016-02-17 09:29:27 UTC

Permalink

Post by Mars
Hi,
Thank you very much for giving us so many advices.

/dev/mapper/vgg145155121036c-pool_nas_tmeta0

Post by Mars
examining superblock
examining devices tree
examining mapping tree
-o nas_thin_dump.xml -r
<superblock uuid="" time="1787" transaction="3545" data_block_size="128"
nr_data_blocks="249980672">
</superblock>
Compared with other normal pools, it seems like all device nodes and

mapping

Post by Mars
info in the metadata lv have lost.

Two possibilities: The device details tree was broken, or worse, the data
mapping tree was broken.

Post by Mars
Is there happened to be 'orphan nodes'? and could you give us your

semi-auto

Post by Mars
repair tools so we can repair it?

Sorry, the code is not finished. Please try my binary first (static binary
compiled on Ubuntu 14.04):
https://www.dropbox.com/s/6g8gm1hndxp3rpd/pdata_tools?dl=0

Please provide the output of thin_ll_dump:
./pdata_tools thin_ll_dump /dev/mapper/vgg145155121036c-pool_nas_tmeta0 -o
nas_thin_ll_dump.xml
(it needs some minutes to go, since that it scan through the entire
metadata (16GB!). I'll improve it later.

Ming-Hung Tsai

M.H. Tsai

2016-02-21 15:41:38 UTC

Permalink

Hi,

I updated the program with some bug fix. Please download it again.
https://www.dropbox.com/s/6g8gm1hndxp3rpd/pdata_tools?dl=0

Here's a quick guide to manually repair a metadata, if thin_repair doesn't help.

1. Run thin_scan to do some basic checking

./pdata_tools thin_scan <metadata> [-o <output.xml>]

The output contains information about:
(1) metadata blocks' type, properties, and integrity
(2) metadata utilization, then you can ignore the rest of metadata
(usually, the last block is an index_block)

Example output:
<single_block type="superblock" location="0" ref_count="4294967295" \
is_valid="1"/>
<range_block type="bitmap_block" location_begin="1" blocknr_begin="1" \
length="3" ref_count="4294967295" is_valid="1"/>
...
<single_block type="index_block" location="26268" blocknr="26268" \
ref_count="4294967295" is_valid="1"/>

2. Check data mapping tree and device details tree

If you don't know how to use thin_debug or superblock's layout,
then you can use thin_ll_dump to obtain the tree roots:

./pdata_tools thin_ll_dump <metadata> [-o <output.xml>] \
[--end <last_utilized_block+1>]

Example output:
<superblock blocknr="0" data_mapping_root="25036" \
device_details_root="25772">
...
</superblock>
<orphans>
...
</orphans>

According to thin_scan's output, we know that the data_mapping_root and
device_details_root points to a wrong location. That's why thin_dump doesn't
work.

<range_block type="btree_leaf" location_begin="25031" blocknr_begin="25031" \
length="7" ref_count="4" is_valid="1" value_size="8"/>
...
<range_block type="btree_leaf" location_begin="25772" blocknr_begin="25772" \
length="2" ref_count="4" is_valid="1" value_size="4"/>

3. Find the correct data mapping root and device details root

(1) If you are using LVM, run lvs to know the thin device ids. That's the key
for data mapping tree and device details tree. Try to find the nodes with
key ranges containing the device ids (see thin_scan's output)
(2) For device details tree, if you have less than 127 thin volumes, then the
tree root is also a leaf. Check the nodes with value_size="24".

Example:
(1) data_mapping_root = 22917 or 25316
(see thin_ll_dump's output)
<node blocknr="22917" flags="2" key_begin="1" key_end="105" \
nr_entries="74"/>
<node blocknr="25316" flags="2" key_begin="1" key_end="105" \
nr_entries="74"/>

(2) device_details_root = 26263 or 26267
(see thin_scan's output)
<single_block type="btree_leaf" location="26263" blocknr="26263" \
ref_count="4294967295" is_valid="1" value_size="24"/>
<single_block type="btree_leaf" location="26267" blocknr="26267" \
ref_count="4294967295" is_valid="1" value_size="24"/>

Currently, thin_ll_dump only lists orphan nodes with value_size==8,
so the orphan device-details leaves won't be listed.

4. Run thin_ll_dump with correct root information:

./pdata_tools thin_ll_dump <metadata_file> --device-details-root=<blocknr> \
--data-mapping-root=<blocknr> [-o thin_ll_dump.xml] \
[--end=<last_utilized_block+1>]

Example:
./bin/pdata_tools thin_ll_dump server_meta.bin --device-details-root=26263 \
--data-mapping-root=22917 -o thin_ll_dump2.xml --end=26269

If the roots are correct, then the number of orphans should be less
than before.

5. Run thin_ll_restore to recover the metadata

./bin/pdata_tools thin_ll_restore -i <edited thin_ll_dump.xml> \
-E <source metadata> -o <output metadata>

Example (restore to /dev/loop0):
./bin/pdata_tools thin_ll_restore -i thin_ll_dump.xml -E server_meta.bin \
-o /dev/loop0

Advance use of thin_ll_restore
===============================

1. Handle the case if the root was broken, and you can only find some internal
or leaf nodes.

Example: All the mappings reachable from block#1234 and #4567 will
be dumped to device#1.
<superblock blocknr="0" data_mapping_root="22917" device_details_root="26263">
<device dev_id="1">
<node blocknr="1234"/>
<node blocknr="5678"/>
...
</device>
</superblock>

2. Create a new device

If the device_id cannot be found in the device details tree,
then thin_ll_dump will create a new device with default device_details values.

Please let me know if you have any questions.

Ming-Hung Tsai

Hi,
...
The output file have nearly 20000 lines and you can find it in attachment.
Thank you very much.
Mars

M.H. Tsai

2016-02-23 12:12:04 UTC

Permalink

The original post asks how to do if the superblock was broken (his superblock
was accidentally wiped). Since that I don't have time to update the program
at this moment, here's my workaround:

1. Partially rebuild the superblock

(1) Obtain pool parameter from LVM

./sbin/lvm lvs vg1/tp1 -o transaction_id,chunksize,lv_size --units s

sample output:
Tran Chunk LSize
3545 128S 7999381504S

The number of data blocks is $((7999381504/128)) = 62495168

(2) Create input.xml with pool parameters obtained from LVM:

<superblock uuid="" time="0" transaction="3545"
data_block_size="128" nr_data_blocks="62495168">
</superblock>

(3) Run thin_restore to generate a temporary metadata with correct superblock

dd if=/dev/zero of=/tmp/test.bin bs=1M count=16
thin_restore -i input.xml -o /tmp/test.bin

The size of /tmp/test.bin depends on your pool size.

(4) Copy the partially-rebuilt superblock (4KB) to your broken metadata.
(<src_metadata>).

dd if=/tmp/test.bin of=<src_metadata> bs=4k count=1 conv=notrunc

2. Run thin_ll_dump and thin_ll_restore
https://www.redhat.com/archives/linux-lvm/2016-February/msg00038.html

Example: assume that we found data-mapping-root=2303
and device-details-root=277313

./pdata_tools thin_ll_dump <src_metadata> --data-mapping-root=2303 \
--device-details-root 277313 -o thin_ll_dump.txt

./pdata_tools thin_ll_restore -E <src_metadata> -i thin_ll_dump.txt \
-o <dst_metadata>

Note that <dst_metadata> should be sufficient large especially when you
have snapshots, since that the mapping trees reconstructed by thintools
do not share blocks.

3. Fix superblock's time field

(1) Run thin_dump on the repaired metadata

thin_dump <dst_metadata> -o thin_dump.txt

(2) Find the maximum time value in data mapping trees
(the device with maximum snap_time might be remove, so find the
maximum time in data mapping trees, not the device detail tree)

grep "time=\"[0-9]*\"" thin_dump.txt -o | uniq | sort | uniq | tail

(I run uniq twice to avoid sorting too much data)

sample output:
...
time="1785"
time="1786"
time="1787"

so the maximum time is 1787.

(3) Edit the "time" value of the <superblock> tag in thin_dump's output

<superblock uuid="" time="1787" ... >
...

(4) Run thin_restore to get the final metadata

thin_restore -i thin_dump.txt -o <dst_metadata>

Ming-Hung Tsai

M.H. Tsai

2016-02-18 14:22:41 UTC

Permalink

Hi,
<superblock>
<device dev_id="7050">
</device>
<device dev_id="7051">
</device>
...
</superblock>
<orphans>
<node blocknr="22496" flags="2" key_begin="0" key_end="128" nr_entries="126"/>
<node blocknr="17422" flags="2" key_begin="0" key_end="128" nr_entries="126"/>
<node blocknr="23751" flags="2" key_begin="0" key_end="2175" nr_entries="126"/>
...
<node blocknr="26257" flags="2" key_begin="7972758" key_end="50331647" nr_entries="242"/>
</orphans>
The output file have nearly 20000 lines and you can find it in attachment.

Looks strange. How many thin volumes do you have? The top-level
mappings tree contains 208 keys, so the top-level mapping tree might
point to a wrong location. Also, no mapped value was output, not sure
if it is a bug...

1. Please run lvs to show the device id
lvs -o lv_name,thin_id
The try to find the orphan nodes with key range containing the device
ids. That could be the real top-level tree node.

2. What's your pool chunk size?
lvs vgg145155121036c/pool_nas -o chunksize

3. Could you please provide your RAW metadata for me to debug? I want
to know why the output went wrong...
You don't need to dump the entire 16GB metadata:

(1) Please run thin_scan to know the metadata utilization (do not rely
on the metadata space map)

./pdata_tools thin_scan /dev/mapper/vgg145155121036c-pool_nas_tmeta0

You don't need to wait for it to complete scanning. Ctrl-C to stop the
program when it stuck for some minutes. The last line is the last
utilized metadata block. For example:

...
<single_block type="btree_leaf" location="234518" blocknr="234518"
ref_count="0" is_valid="1" value_size="4"/>
<single_block type="btree_leaf" location="234519" blocknr="234519"
ref_count="0" is_valid="1" value_size="32"/>
<single_block type="index_block" location="234520" blocknr="234520"
ref_count="0" is_valid="1"/>
(the program stuck here, break the program)

Then block#234520 is the last utilized metadata block. Usually it is
an index_block.

(2) dump & compress the used part. Send me the file if you can.
dd if=/dev/mapper/vgg145155121036c-pool_nas_tmeta0 of=tmeta.bin bs=4K
count=$((234520+1))
tar -czvf tmeta.tar.gz tmeta.bin

Thanks,
Ming-Hung Tsai