Ceph

April 19, 2015

0 Comments

A long story, binds me with Ceph (partially documented on my wiki).
Mostly at Smile, but also at home, I tested versions from 0.47 to what is now 0.87.
By Christmas 2014, I bought myself 5 Prolian MicroServer, trying to allocate dedicated physical resources to manage virtual file systems.
Nodes all contains a 60G SSD, used for the root filesystem, as well as OSD journals. 1 512GB SSD disk, for “fast” filesystems, and 3 disks from 1 to 4T, filling up the left slots.

Almost a month later, while at work, one of my node stopped answering. Immediately, the cluster tried to recover degraded placement groups, by duplicating the remaining replica to some free space.
At some point, the recovery process filled up a disk until reaching its limit. The cluster was now missing space.
When I realized the problem, I left work early, rushed home and reboot the failing server.
Recovering the missing disks, the cluster remained on a degraded state, because of the filled disk from earlier. The daemon managing this disk was unable to start, because of the disk being too full. So I ended up dropping its content, reformatting the disk, hoping I would still have an other replica of whatever I was destroying.
On the bright side, the cluster started re-balancing itself, I could finally restart my VMs, … Retrospectively, I don’t see how else I could have get it back up otherwise.
Meanwhile, I actually did lost data in the process. One placement group, remaining degraded.

The cluster being unusable, and yet storing somewhat relevant data for my personal use, I ended up creating a new pool (implying: new placement groups), re-creating my cloud storage space from scratch.

After two weeks on ceph IRC, I found one person with the same problem, no one with an actual solution.
I vaguely heard of the possibility to `cat’ files from osd partitions, to reconstruct an image file matching the one presented by ceph. Pending further investigations, …

And here we are, a few months latter.
The situation is getting worse every day.
I’m finally investigating.

Basically, you have to look at rbd infos to retrieve a string, contained by all names of files holding your data:

moros:~# rbd -p disaster info one-70
rbd image ‘one-70’:
size 195 GB in 50001 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.66d52ae8944a
format: 2
features: layering

From there, you’ll use a script to list all files you actually need to reconstruct your disk.

In my case, the script creates a directory $disk_name, and stores a file per physical host, listing corresponding blocks.

Then, you’ll use a script to store all these files into a single directory, for further processing.
Note file names all contains a backslash, therefor our script would need to connect to the OSD host and then, run some scp command to send the file to designed destination. Such mess requires to exchange SSH keys, … be warned.
An other way to do this may be to share your OSD roots, most likely using NFS, and mount them on the host reconstructing data.

Finally, we can follow Sebastien Han’s procedure, relying on rbd_restore.

You should end up with a disk image file.
If your disks all contains partitions, which is my case, … then fdisk -l the obtained file.
Get the offset where starts the partition you would like to recover, and the block size. Run dd if=your_image bs=block_size skip=start_offset of=$disk_from_partition.
Run fsck on the image obtained.
If you see complaints about bad geometry: block count xx exceeds size of device (yy blocks), then fire up resize2fs.
Assuming xfs, if your dmesg tells about attempt to access beyond end of device, look for a want=XXX, limit=YYY line to deduce the amount of space missing from your disk, then using dd if=/dev/zero of=restored_image seek=${actual size + length to add in MB} obs=1M count=0 should append zeroes to your image, allowing you to mount your disk.
An exhaustive log is available there.

Categories: Ceph UTGB Services

BOFH meditations

Ceph

Leave a reply Cancel reply