Redeploy Kubernetes Nodes with KubeSpray

After suffering a disk outage, I had to reinstall a KVM node from scratch, and the Kubernetes master node it was hosting.

The process to deal with such an outage is fairly well documented, in the KubeSpray repository. Though I’ve had a couple complications we will discuss here.

 

Backups are for loosers. This is a lab I use for R&D, nothing I can’t redeploy. We can start with step two, provisioning new machines to serve as a replacement master, and a second one that would be an ingress node. We would re-use DNS names and IP addresses formerly used by the machines that were lost, to keep things simple.

We would then edit the inventory that was used to bootstrap our cluster. Make sure the faulty master node is the last one listed, in both kube-master and etcd host groups.

Then, the doc would tell you to “Move any broken etcd nodes into the broken_etcd group, make sure the etcd_member_name variable is set“. First remark here: don’t move out nodes of any group. Doc would then tell you to apply a playbook against the etcd and kube-master groups, it would make no sense for those not to include the node we’re recovering. Create the broken_kube-master and broken_etcd host groups, that would only include the name for the node we’re redeploying.

I assumed the etcd_member_name variable needed to be set for the node I was recovering, and though it would be clever to set it in a group_vars/broken_etcd.yaml, to avoid variables in inventories, or creating a host_vars directory. Beware this would break etcd configuration on all nodes. Somehow the broken_etcd variable applies to all members. In such case fixing the /etc/etcd.env configuration file should get you back up.

Once done preparing your inventory, we would proceed with applying the recovery playbook:

ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master -e etcd_retries=300 recover-control-plane.yml

After a short while, the recovering node was properly added back to the etcd cluster, though it was still unreachable from my other masters. Turns out the ETCD_INITIAL_ADVERTISE_PEER_URLS variable was not set, in the newly installed etcd.env. In system logs (or using netstat), you could see etcd only binding to the loopback interface. Adding the proper URL in there, then restarting etcd was enough, data got replicated pretty quickly.

Playbook crashed a little later. Etcd wasn’t an issue really, it kept running while I was fixing my configuratins, to then fail trying to re-deploy the Ceph RBD provisioner. I don’t want to do these anyway, I went back editing my group_vars, disabling any third-party integration (certmanager, ingress, ceph/cephfs, …).

At that stage, etcd is back up, instead of re-applying the recovery playbook, I went back to the usual cluster.yml, which did the rest:

ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master cluster.yml

In the end, we can see the node was re-created, though API still lists it as NotReady. Checking logs, calico was failing to start: the file installed in /etc/cni/net.d/10-calico.conflist doesn’t look right. Checking on my other nodes, I could confirm they all have the same copy, which I installed over that provisioned by KubeSpray. Wait a minute, and the node is back up.

Next, as I also lost an ingress node, and to try something new, I would apply the scale.yml playbook. Note this implies the node would not be delete from the API, and we’re expecting its old certificates being re-used:

ansible-playbook -i inventory/my/hosts.yml scale.yml

In the end, we can confirm, on the node we re-deployed, the /etc/ssl/etcd/ssl/ holds a pair of files, dating back from initial node deployment. Also, I’m pleased to see calico properly started up, no need to patch configurations this time.

I was still left with containers in CrashLoopBackOff, unable to contact the Kubernetes API, on the nodes I redeployed. I first tried to restart the corresponding Calico containers, then those still unable to start. Nothing I wouldn’t have seen with OpenShift.

 

From switching off my faulty KVM host to change its disk, re-install it using PXE, re-deploy it using puppet, then re-deploying Kubernetes nodes, I lost about 3 hours. Nevertheless, this was a good experience.

KubeSpray might not be perfect, though having worked with OpenShift 3, I don’t mind debugging playbooks as they run – and frankly, this went so much better than the usual OpenShift crash recovery.