Menu

Daily archives "April 19, 2015"

2 Articles

Ceph

A long story, binds me with Ceph (partially documented on my wiki).
Mostly at Smile, but also at home, I tested versions from 0.47 to what is now 0.87.
By Christmas 2014, I bought myself 5 Prolian MicroServer, trying to allocate dedicated physical resources to manage virtual file systems.
Nodes all contains a 60G SSD, used for the root filesystem, as well as OSD journals. 1 512GB SSD disk, for “fast” filesystems, and 3 disks from 1 to 4T, filling up the left slots.

Almost a month later, while at work, one of my node stopped answering. Immediately, the cluster tried to recover degraded placement groups, by duplicating the remaining replica to some free space.
At some point, the recovery process filled up a disk until reaching its limit. The cluster was now missing space.
When I realized the problem, I left work early, rushed home and reboot the failing server.
Recovering the missing disks, the cluster remained on a degraded state, because of the filled disk from earlier. The daemon managing this disk was unable to start, because of the disk being too full. So I ended up dropping its content, reformatting the disk, hoping I would still have an other replica of whatever I was destroying.
On the bright side, the cluster started re-balancing itself, I could finally restart my VMs, … Retrospectively, I don’t see how else I could have get it back up otherwise.
Meanwhile, I actually did lost data in the process. One placement group, remaining degraded.

The cluster being unusable, and yet storing somewhat relevant data for my personal use, I ended up creating a new pool (implying: new placement groups), re-creating my cloud storage space from scratch.

After two weeks on ceph IRC, I found one person with the same problem, no one with an actual solution.
I vaguely heard of the possibility to `cat’ files from osd partitions, to reconstruct an image file matching the one presented by ceph. Pending further investigations, …

And here we are, a few months latter.
The situation is getting worse every day.
I’m finally investigating.

Basically, you have to look at rbd infos to retrieve a string, contained by all names of files holding your data:

moros:~# rbd -p disaster info one-70
rbd image ‘one-70’:
size 195 GB in 50001 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.66d52ae8944a
format: 2
features: layering

From there, you’ll use a script to list all files you actually need to reconstruct your disk.

In my case, the script creates a directory $disk_name, and stores a file per physical host, listing corresponding blocks.

Then, you’ll use a script to store all these files into a single directory, for further processing.
Note file names all contains a backslash, therefor our script would need to connect to the OSD host and then, run some scp command to send the file to designed destination. Such mess requires to exchange SSH keys, … be warned.
An other way to do this may be to share your OSD roots, most likely using NFS, and mount them on the host reconstructing data.

Finally, we can follow Sebastien Han’s procedure, relying on rbd_restore.

You should end up with a disk image file.
If your disks all contains partitions, which is my case, … then fdisk -l the obtained file.
Get the offset where starts the partition you would like to recover, and the block size. Run dd if=your_image bs=block_size skip=start_offset of=$disk_from_partition.
Run fsck on the image obtained.
If you see complaints about bad geometry: block count xx exceeds size of device (yy blocks), then fire up resize2fs.
Assuming xfs, if your dmesg tells about attempt to access beyond end of device, look for a want=XXX, limit=YYY line to deduce the amount of space missing from your disk, then using dd if=/dev/zero of=restored_image seek=${actual size + length to add in MB} obs=1M count=0 should append zeroes to your image, allowing you to mount your disk.
An exhaustive log is available there.

eCryptfs

You may be familiar with eCryptfs, a disk encryption software, especially known for being shipped with Ubuntu default installations, being the ‘Encrypted Home’ underlying solution.

From what I gathered of empirical experiences, I would say eCryptfs, AKA Enterprise Cryptographic Filesystem, should be avoided in both enterprise and personal use cases.

My first surprise was while crawling a site for its record: creating a file per author, I ended up with more than 5000 files in a single directory. At which point, your dmesg should show something like several ‘Valid eCryptfs header not found in file header region or xattr region, inode XXX‘, then a fiew ‘Watchdog[..]: segfault at 0 ip $addr1 sp $addr2 error 6 in libcontent.so[$addr3]

The second surprise, a few days later, while recovering rbd from my ceph cluster. Storing all parts from a disk into a same directory, again, I ended up with folders holding several thousands of file, and my dmesg going mad.
You would notice moving your folder outside of your ecryptfs directory will fix the problem.

Of course, most users won’t recover their ceph nor crawl a site for its complete database. Although, configuring Thunderbird/IceDove, you may end up caching enough mails from your imap/pop server to reach the limits I did.

This is not my first experience with cryptographic solutions.
Once upon a time, TrueCrypt was offering a very exhaustive toolkit, managing from devices to files – so much so, my end-of-studies project was bout forking TrueCrypt, adding features the maintainer did not wanted to see in its product (BootTruster).
On today’s Linux systems, a more classic way to do it would be to use Luks (Linux Unified Key Setup), based on a standardized device-mapper: dm-crypt.

Anyway: next time you’re asked about encrypting your home file system, think twice about what solution you’re going to use. Then, chose Luks.