{"id":128,"date":"2015-07-01T23:41:34","date_gmt":"2015-07-01T21:41:34","guid":{"rendered":"https:\/\/blog.unetresgrossebite.com\/?p=128"},"modified":"2015-07-01T23:44:08","modified_gmt":"2015-07-01T21:44:08","slug":"ceph-disk-failure","status":"publish","type":"post","link":"https:\/\/blog.unetresgrossebite.com\/?p=128","title":{"rendered":"Ceph Disk Failure"},"content":{"rendered":"<p>Last week, one of my five Ceph hosts was unreachable.<br \/>\nInvestigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing\u00a0the connection).<\/p>\n<p>After rebooting the system and looking at logs, I could see a lot of I\/O errors. I left the console logged in to root, waiting for the next occurrence.<br \/>\nHaving no spare 60GB SSD, I ordered one.<\/p>\n<p>Two days later, the same problem occurred. From the console, I was unable to run anything (mostly segfaults and ENOENT).<br \/>\nAgain, I was able to reboot. This time, I dropped a couple LVMs, unmounted the swap partition, and resized my VG to make sure I had a fair amount of unallocated space on my faulty disk.<\/p>\n<p>The problem persisted, while average uptime was significantly getting lower.<br \/>\nI progressively disabled local OSSEC daemon, puppet, a few crontabs, collectd, munin, &#8230; only keeping ceph, nagios and ssh running. The problem kept happening, every 12 to 48 hours.<\/p>\n<p>This morning, the server wasn&#8217;t even able to\u00a0boot.<br \/>\nChecking the BIOS, my root SSD wasn&#8217;t detected.<br \/>\nAttaching it to some USB dock, I had to wait a couples minutes before the disk was actually detected by my laptop\u00a0(Ubuntu 14.04.02), and my desktop (Debian 7.8).<br \/>\nI caught a break when receiving my new disk at\u00a011 AM.<br \/>\nRunning dd from the faulty disk to the new one took around 50 minutes (20MB\/s, I can&#8217;t believe it!).<br \/>\nSyncing (1x512G SSD, 2*4T &amp; 1*3T HDD) after 8\u00a0hours of downtime, took around\u00a0half an hour. Knowing I run a fairly busy mail server, some\u00a0nntp\u00a0index, &#8230;), this is a new tangible improvement brought by Hammer, over Firefly.<\/p>\n<p>I&#8217;m now preparing to send\u00a0the faulty disk to my re-seller, for replacement. At least, I would have one handy, for the next failure.<\/p>\n<p>Morality: cheap is unreasonable. Better be lucky.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last week, one of my five Ceph hosts was unreachable. Investigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing\u00a0the connection). After rebooting the system and looking at logs, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[8,5,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/128"}],"collection":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=128"}],"version-history":[{"count":3,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/128\/revisions"}],"predecessor-version":[{"id":131,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/128\/revisions\/131"}],"wp:attachment":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=128"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=128"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=128"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}