Menu

Ceph Disk Failure

Last week, one of my five Ceph hosts was unreachable.
Investigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing the connection).

After rebooting the system and looking at logs, I could see a lot of I/O errors. I left the console logged in to root, waiting for the next occurrence.
Having no spare 60GB SSD, I ordered one.

Two days later, the same problem occurred. From the console, I was unable to run anything (mostly segfaults and ENOENT).
Again, I was able to reboot. This time, I dropped a couple LVMs, unmounted the swap partition, and resized my VG to make sure I had a fair amount of unallocated space on my faulty disk.

The problem persisted, while average uptime was significantly getting lower.
I progressively disabled local OSSEC daemon, puppet, a few crontabs, collectd, munin, … only keeping ceph, nagios and ssh running. The problem kept happening, every 12 to 48 hours.

This morning, the server wasn’t even able to boot.
Checking the BIOS, my root SSD wasn’t detected.
Attaching it to some USB dock, I had to wait a couples minutes before the disk was actually detected by my laptop (Ubuntu 14.04.02), and my desktop (Debian 7.8).
I caught a break when receiving my new disk at 11 AM.
Running dd from the faulty disk to the new one took around 50 minutes (20MB/s, I can’t believe it!).
Syncing (1x512G SSD, 2*4T & 1*3T HDD) after 8 hours of downtime, took around half an hour. Knowing I run a fairly busy mail server, some nntp index, …), this is a new tangible improvement brought by Hammer, over Firefly.

I’m now preparing to send the faulty disk to my re-seller, for replacement. At least, I would have one handy, for the next failure.

Morality: cheap is unreasonable. Better be lucky.

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>