{"id":990,"date":"2021-10-17T10:28:31","date_gmt":"2021-10-17T08:28:31","guid":{"rendered":"https:\/\/blog.unetresgrossebite.com\/?p=990"},"modified":"2021-10-18T00:58:20","modified_gmt":"2021-10-17T22:58:20","slug":"ceph-crash-recovery","status":"publish","type":"post","link":"https:\/\/blog.unetresgrossebite.com\/?p=990","title":{"rendered":"Ceph Crash Recovery"},"content":{"rendered":"\n<p>Today we will see the steps to replace a Ceph node, including the MON, MGR, MDS and OSD services.<\/p>\n\n\n\n<p>While I would usually try to salvage the disks from a failing node, sometimes using <em>dd conv=noerror,sync iflag=fullblock<\/em> dumping a filesystem from a disk ignoring IO errors, others using a disk duplication devices: today I am unable to salvage my node filesystem, we would see the steps to re-deploy a new host from scratch.<\/p>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Redeploy Hypervisor<\/strong><\/p>\n\n\n\n<p>Starting from new drives, I reinstalled my physical server, using a local PXE. Using puppet, the server is then configured as a KVM hypervisor, and automount gives me access to my KVM models. I would next deploy a CentOS 8 guest, re-using the FQDN and IP address from my previous Ceph node.<\/p>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Ceph MON<\/strong><\/p>\n\n\n\n<p>I would start by installing EPEL and Ceph repositories &#8211; matching my Ceph release, Octopus. Update the system, as my CentOS KVM template dates back from 8.1. Reboot. Then, we would install Ceph packages:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-default\"><p><em>dnf install -y ceph-osd python3-ceph-common ceph-mon ceph-mgr ceph-mds ceph-base python3-cephfs<\/em><\/p><\/blockquote>\n\n\n\n<p><\/p>\n\n\n\n<p>For simplicity, we would disable SELinux and firewalling &#8211; matching my other nodes configuration, though this may not be recommended in general. Next, we would retrieve Ceph configuration out of a surviving MON node:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>scp -rp mon1:\/etc\/ceph \/etc\/<\/em><\/p><\/blockquote>\n\n\n\n<p>Confirm that we can now query the cluster, and fetch the Ceph mon. keyring:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>ceph osd tree<br>ceph auth get mon. -o \/tmp\/keyring<br>ceph mon getmap -o \/tmp\/monmap<\/em><\/p><\/blockquote>\n\n\n\n<p>Having those, you may now re-create your MON, re-using its previous ID (in my case, mon3)<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>mkdir \/var\/lib\/ceph\/mon\/ceph-mon3<br>ceph-mon -i mon3 &#8211;mkfs &#8211;monmap \/tmp\/monmap &#8211;keyring \/tmp\/keyring<br>chown -R ceph:ceph \/var\/lib\/ceph\/mon\/ceph-mon3<\/em><\/p><\/blockquote>\n\n\n\n<p>Next, re-create the systemd unit starting Ceph MON daemon:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>mkdir <meta http-equiv=\"content-type\" content=\"text\/html; charset=utf-8\">\/etc\/systemd\/system\/ceph-mon.target.wants\/<br>cd <meta http-equiv=\"content-type\" content=\"text\/html; charset=utf-8\">\/etc\/systemd\/system\/ceph-mon.target.wants\/<br>ln -sf \/usr\/lib\/systemd\/system\/ceph-mon\\@.service ceph-mon\\@mon3.service<br>systemctl daemon-reload<br>systemctl start ceph-mon@mon3.service<br>journalctl -u ceph-mon@mon3.service<br>ceph -s<\/em><\/p><\/blockquote>\n\n\n\n<p>At that stage, we should be able to see the third monitor re-joined our cluster.<\/p>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Ceph OSDs<\/strong><\/p>\n\n\n\n<p>The next step would be to re-deploy our OSDs. Note that, at that stage, we should be able to re-import our existing OSD devices, assuming they were not affected by the outage. Here, I would also proceed with fresh drives (logical volumes on my hypervisor).<\/p>\n\n\n\n<p>Let&#8217;s fetch the osd bootstrap key (could be done with ceph auth get, as for the mon. key above) and prepare the directories for my OSDs (IDs 1 and 7)<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>scp -p mon1:\/var\/lib\/ceph\/bootstrap-osd\/ceph.keyring \/var\/lib\/ceph\/bootstrap-osd\/<br>mkdir -p \/var\/lib\/ceph\/osd\/ceph-1 \/var\/lib\/ceph\/osd\/ceph-7<br>chown -R ceph:ceph \/var\/lib\/ceph\/bootstrap-osd \/var\/lib\/ceph\/osd<\/em><\/p><\/blockquote>\n\n\n\n<p>Next, I would re-create my first OSD:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>ceph osd out osd.7<br>ceph osd destroy 7<br>ceph osd destroy 7 &#8211;yes-i-really-mean-it<br>ceph-volume lvm zap \/dev\/vdb<br>ceph-volume lvm prepare &#8211;osd-id 7 &#8211;data \/dev\/vdb<br>ceph-volume lvm activate &#8211;all<br>ceph osd tree<br>ceph -s<\/em><\/p><\/blockquote>\n\n\n\n<p>We should be able to confirm the OSD 7 is back up, and that Ceph is now re-balancing data. Let&#8217;s finish with the second OSD I need to recover, following the same process:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>ceph osd out osd.1<br>ceph osd destroy 1<br>ceph osd destroy 1 &#8211;yes-i-really-mean-it<br>ceph-volume lvm zap \/dev\/vdc<br>ceph-volume lvm prepare &#8211;osd-id 1 &#8211;data \/dev\/vdc<br>ceph-volume lvm activate &#8211;all<br>ceph osd tree<br>ceph -s<\/em><\/p><\/blockquote>\n\n\n\n<p>At that stage, the most critical part is done. Considering I am running on commodity hardware, VMs on old HP Micro Servers with barely 6G RAM and a couple CPUs, I&#8217;ve left Ceph alone for something like 12 hours, to avoid wasting resources with MDS and MGR daemons that are not strictly mandatory.<\/p>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Ceph MGR<\/strong><\/p>\n\n\n\n<p>My data being pretty much all re-balanced, no more degraded or undersized PGs, I eventually redeployed the MGR service:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>mkdir \/var\/lib\/ceph\/mgr\/ceph-mon3<br>ceph auth get mgr.mon3 -o \/var\/lib\/ceph\/mgr\/ceph-mon3\/keyring<br>chown -R ceph:ceph \/var\/lib\/ceph\/mgr<br>mkdir ceph-mgr.target.wants<br>cd ceph-mgr.target.wants<br>ln -sf \/usr\/lib\/systemd\/system\/ceph-mgr\\@.service ceph-mgr\\@mon3.service<br>systemctl daemon-reload<br>systemctl start ceph-mgr@mon3<br>ceph -s<\/em><\/p><\/blockquote>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Ceph MDS<\/strong><\/p>\n\n\n\n<p>Next, as I&#8217;m using CephFS with my Kubernetes clusters, I would need to recreate my third MDS. The process is pretty similar to that of deploying a MGR:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>mkdir \/var\/lib\/ceph\/mds\/ceph-mon3<br>ceph auth get mds.mon3 -o \/var\/lib\/ceph\/mds\/ceph-mon3\/keyring<br>chown -R ceph:ceph \/var\/lib\/ceph\/mds<br>mkdir ceph-mds.target.wants<br>cd ceph-mds.target.wants<br>ln -sf \/usr\/lib\/systemd\/system\/ceph-mds\\@.service ceph-mds\\@mon3.service<br>systemctl daemon-reload<br>systemctl start ceph-mds@mon3<br>ceph -s<\/em><\/p><\/blockquote>\n\n\n\n<p> &nbsp; <\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Finishing up<\/strong><\/p>\n\n\n\n<p>In my case, I&#8217;ve also installed and configured NRPE probes.<br>Note that I did not cover the topic of re-deploying RadosGW: in my case, with limited resources on my Ceph nodes, my RGW daemons are deployed as containers, in Kubernetes. My workloads that require s3 storage may then use Kubernetes local Services and Secrets with pre-provisioned radosgw user credentials.<\/p>\n\n\n\n<p>Obviously I could have used <a href=\"https:\/\/github.com\/ceph\/ceph-ansible\" target=\"_blank\" rel=\"noopener\">ceph-ansible<\/a> redeploying my node. Though there&#8217;s more fun in knowing how it works behind the curtain. I would mostly use ceph-ansible deploying clusters, with customers &#8212; mine was last deployed a couple years ago.<\/p>\n\n\n\n<p>The time going from an installed CentOS to a fully recovered Ceph node is pretty amazing. No more than one hour re-creating Ceph daemons, a little over 12h watching objects being recovered and re-balanced &#8212; for over 2Tb of data \/ 5T raw, and again: old and low quality hardware, when compared with what we would deal with IRL.<\/p>\n\n\n\n<p>From its first stable release, in early 2010s, to the product we have today, Ceph is a undeniably a success story in the Open Sources ecosytem. Haven&#8217;t seen a corruption I couldn&#8217;t easily fix in years, low maintenance cost, still runs fairly well on commodity hardware. And all that despite of RedHat usual fuckery.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today we will see the steps to replace a Ceph node, including the MON, MGR, MDS and OSD services. While I would usually try to salvage the disks from a failing node, sometimes using dd conv=noerror,sync iflag=fullblock dumping a filesystem from a disk ignoring IO errors, others using a disk duplication devices: today I am [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[12,5,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/990"}],"collection":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=990"}],"version-history":[{"count":17,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/990\/revisions"}],"predecessor-version":[{"id":1015,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/990\/revisions\/1015"}],"wp:attachment":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=990"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=990"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}