Ceph Crash Recovery

Today we will see the steps to replace a Ceph node, including the MON, MGR, MDS and OSD services.

While I would usually try to salvage the disks from a failing node, sometimes using dd conv=noerror,sync iflag=fullblock dumping a filesystem from a disk ignoring IO errors, others using a disk duplication devices: today I am unable to salvage my node filesystem, we would see the steps to re-deploy a new host from scratch.


Redeploy Hypervisor

Starting from new drives, I reinstalled my physical server, using a local PXE. Using puppet, the server is then configured as a KVM hypervisor, and automount gives me access to my KVM models. I would next deploy a CentOS 8 guest, re-using the FQDN and IP address from my previous Ceph node.


Ceph MON

I would start by installing EPEL and Ceph repositories – matching my Ceph release, Octopus. Update the system, as my CentOS KVM template dates back from 8.1. Reboot. Then, we would install Ceph packages:

dnf install -y ceph-osd python3-ceph-common ceph-mon ceph-mgr ceph-mds ceph-base python3-cephfs

For simplicity, we would disable SELinux and firewalling – matching my other nodes configuration, though this may not be recommended in general. Next, we would retrieve Ceph configuration out of a surviving MON node:

scp -rp mon1:/etc/ceph /etc/

Confirm that we can now query the cluster, and fetch the Ceph mon. keyring:

ceph osd tree
ceph auth get mon. -o /tmp/keyring
ceph mon getmap -o /tmp/monmap

Having those, you may now re-create your MON, re-using its previous ID (in my case, mon3)

mkdir /var/lib/ceph/mon/ceph-mon3
ceph-mon -i mon3 –mkfs –monmap /tmp/monmap –keyring /tmp/keyring
chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon3

Next, re-create the systemd unit starting Ceph MON daemon:

mkdir /etc/systemd/system/
cd /etc/systemd/system/
ln -sf /usr/lib/systemd/system/ceph-mon\@.service ceph-mon\@mon3.service
systemctl daemon-reload
systemctl start ceph-mon@mon3.service
journalctl -u ceph-mon@mon3.service
ceph -s

At that stage, we should be able to see the third monitor re-joined our cluster.


Ceph OSDs

The next step would be to re-deploy our OSDs. Note that, at that stage, we should be able to re-import our existing OSD devices, assuming they were not affected by the outage. Here, I would also proceed with fresh drives (logical volumes on my hypervisor).

Let’s fetch the osd bootstrap key (could be done with ceph auth get, as for the mon. key above) and prepare the directories for my OSDs (IDs 1 and 7)

scp -p mon1:/var/lib/ceph/bootstrap-osd/ceph.keyring /var/lib/ceph/bootstrap-osd/
mkdir -p /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-7
chown -R ceph:ceph /var/lib/ceph/bootstrap-osd /var/lib/ceph/osd

Next, I would re-create my first OSD:

ceph osd out osd.7
ceph osd destroy 7
ceph osd destroy 7 –yes-i-really-mean-it
ceph-volume lvm zap /dev/vdb
ceph-volume lvm prepare –osd-id 7 –data /dev/vdb
ceph-volume lvm activate –all
ceph osd tree
ceph -s

We should be able to confirm the OSD 7 is back up, and that Ceph is now re-balancing data. Let’s finish with the second OSD I need to recover, following the same process:

ceph osd out osd.1
ceph osd destroy 1
ceph osd destroy 1 –yes-i-really-mean-it
ceph-volume lvm zap /dev/vdc
ceph-volume lvm prepare –osd-id 1 –data /dev/vdc
ceph-volume lvm activate –all
ceph osd tree
ceph -s

At that stage, the most critical part is done. Considering I am running on commodity hardware, VMs on old HP Micro Servers with barely 6G RAM and a couple CPUs, I’ve left Ceph alone for something like 12 hours, to avoid wasting resources with MDS and MGR daemons that are not strictly mandatory.


Ceph MGR

My data being pretty much all re-balanced, no more degraded or undersized PGs, I eventually redeployed the MGR service:

mkdir /var/lib/ceph/mgr/ceph-mon3
ceph auth get mgr.mon3 -o /var/lib/ceph/mgr/ceph-mon3/keyring
chown -R ceph:ceph /var/lib/ceph/mgr
ln -sf /usr/lib/systemd/system/ceph-mgr\@.service ceph-mgr\@mon3.service
systemctl daemon-reload
systemctl start ceph-mgr@mon3
ceph -s


Ceph MDS

Next, as I’m using CephFS with my Kubernetes clusters, I would need to recreate my third MDS. The process is pretty similar to that of deploying a MGR:

mkdir /var/lib/ceph/mds/ceph-mon3
ceph auth get mds.mon3 -o /var/lib/ceph/mds/ceph-mon3/keyring
chown -R ceph:ceph /var/lib/ceph/mds
ln -sf /usr/lib/systemd/system/ceph-mds\@.service ceph-mds\@mon3.service
systemctl daemon-reload
systemctl start ceph-mds@mon3
ceph -s


Finishing up

In my case, I’ve also installed and configured NRPE probes.
Note that I did not cover the topic of re-deploying RadosGW: in my case, with limited resources on my Ceph nodes, my RGW daemons are deployed as containers, in Kubernetes. My workloads that require s3 storage may then use Kubernetes local Services and Secrets with pre-provisioned radosgw user credentials.

Obviously I could have used ceph-ansible redeploying my node. Though there’s more fun in knowing how it works behind the curtain. I would mostly use ceph-ansible deploying clusters, with customers — mine was last deployed a couple years ago.

The time going from an installed CentOS to a fully recovered Ceph node is pretty amazing. No more than one hour re-creating Ceph daemons, a little over 12h watching objects being recovered and re-balanced — for over 2Tb of data / 5T raw, and again: old and low quality hardware, when compared with what we would deal with IRL.

From its first stable release, in early 2010s, to the product we have today, Ceph is a undeniably a success story in the Open Sources ecosytem. Haven’t seen a corruption I couldn’t easily fix in years, low maintenance cost, still runs fairly well on commodity hardware. And all that despite of RedHat usual fuckery.


Today we would take a short break from Kubernetes, and try out another container orchestration solution: Mesos.

Mesos is based on a research project, first presented in 2009 under the name of Nexus. Its version 1 was announced in 2016 by the Apache Software Foundation. In April 2021, a vote concluded Mesos should be moved to the Apache Attic, suggesting it reached end of life, though that vote was later cancelled, due to increased interest.


A Mesos cluster is based on at least three components: a Zookeeper server (or cluster), a set of Mesos Master and the Mesos Slave agent, that would run on all nodes in our cluster. Which could be compared to Kubernetes etcd, control plane and kubelet respectively.

Setting it up, we would install Mesos Repository (on CentOS7:, then install the mesos package on all nodes, as well as mesosphere-zookeeper on the Zookeeper nodes.

Make sure /etc/mesos/zk points to your Zookeeper node(s).

Open ports 2181/tcp on your Zookeeper nodes, port 5050/tcp on master nodes, and 5051/tcp on agents.

Start zookeeper, mesos-master and mesos-slave services.

We should be able to connect our Master console, on port 5050, and confirm our agents have properly registered.Mesos Agents







With Mesos, we are likely to first deploy some frameworks, that would then orchestrate our applications deployment.

Frameworks would take care of scheduling workloads on your behalf, using Mesos resources. The first one we should look into is Marathon, which could be compared to Kubernetes Deployment and StatefulSets controller, in that it would ensure an application is “always on”.

We may deploy Marathon on a separate node, or to an existing Mesos instance, installing the marathon package from the same repository we installed Mesos from. Note that CentOS7 default JRE seems to be an issue running Marathon latest releases: you may want to install marathon-1.4 instead.

Marathon listens on port 8080/tcp, which should be opened as well. Having done so, we may connect the Marathon UI to deploy our applications – or POST json objects to its API.


Now, we could look into what makes Mesos interesting, when compared with Kubernetes.

You may know Mesos can be used deploying Docker container, which would be done posting something like the following, to the Marathon API:

Though the above relies on the Docker runtime, which is not mandatory, using Mesos.

Indeed, we could describe the same deployment, with the following:

Mesos comes with its own containerizer, able to start applications from OCI images.

We could also note that Mesos may be used managing services, based on binary or whatever asset we may pull out of an HTTP repository – or already present, on our nodes filesystem. To demonstrate this, we could start some Airsonic application, posting the following to Marathon:

Going further, we should look into other frameworks, such as Chronos – somewhat comparable to Kubernetes Jobs controller.

PID / IPC / network / filesystem / … isolations may be implemented using isolators. Though a lot of features we would usually find in Kubernetes would here be optional.

Nevertheless, deployments are pretty fast considering the few resources I could allocate that lab. With the vast majority of container orchestration solutions being based on the same projects nowadays, it’s nice to see a contestant with its own original take. Even though Mesos probably suffers from limited interest and contributions, when compared with Kubernetes.

Kubernetes Cluster Upgrade with Kubespray

Last year, I did deploy a Kubernetes cluster using Kubepsray – after giving a try to OpenShift 4.

I did deploy a 1.18.3, which is reaching EOL, and am now looking into upgrading it.
The Kubespray documentation is pretty straight forward: iterate over their releases, one after the other, re-applying the upgrade playbook, and eventually
forcing the Kubernetes version.

$ cd /path/to/kubespray
$ git status
On branch master
$ git pull
$ git tag

My first issue being that I did not use a Kubepsray release to deploy my cluster: I just cloned their repository and went with their last master – worked perfectly fine, which is a testament to the stability of their code.

Looking at existing tags in their repository, I tried to figure out was the closest to the one I’ve been using deploying that cluster. My Kubernetes version 1.18.3 being somewhere in between Kubespray v1.13 and v1.14.0.

$ git diff 6bc60e021e39b049ec7135bd4cfb4adfce44d1f7..v2.13.3
$ git diff 6bc60e021e39b049ec7135bd4cfb4adfce44d1f7..v2.13.4
$ git diff 6bc60e021e39b049ec7135bd4cfb4adfce44d1f7..v2.14.0
$ git diff 6bc60e021e39b049ec7135bd4cfb4adfce44d1f7..v2.14.1

I decided to start from v1.14.0. Their default Kubernetes version being 1.18.8.

Kubepsray upgrades should be applied one after the other.
According to their doc, one shouldn’t skip any tag – though checking their diffs, it looks like we may skip patch releases.

First, we would check the changes in Kubepsray sample inventory, figuring out which variables needs to be added, removed or changed, from our cluster inventory:

$ git diff 6bc60e021e39b049ec7135bd4cfb4adfce44d1f7..v2.14.0 inventory/sample
$ vi inventory/mycluster/group_vars/all/all.yaml
$ vi inventory/mycluster/group_vars/k8s-cluster/addons.yaml
$ vi inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yaml

Chosing the target Kubernetes version, make sure that it is handled by Kubespray.
Check for the kubelet_checksums and crictl_versions arrays in roles/downloads/defaults/main.yaml.
When in doubt, stick with the one configured in their sample inventory.

Once our inventory is ready, we could take some time to make sure our cluster is in an healthy state.
If there’s any deployment that can be shut down, replicas count that can be lowered, … any workload that can be temporarily removed would speed up your upgrade time.
We could also check disk usages, clean it all up, there’s no troubles updating repositories, pulling images, …
In my case, I would also check my Ceph cluster, serving persistent volumes for Kubernetes, ensure that all services are up, that there’s no risk some volume could be stuck at some point, waiting for an I/O or something, …

Eventually, we may start applying our upgrade:

$ ansible-playbook -i ./inventory/mycluster/hosts.yaml ./upgrade-cluster.yml \
    | tee -a upgrade-$(date +%s)-from-1.18.3-to-1.18.8.log

We would see Kubepsray would start checking cluster variables, eventually pre-pulling and caching assets and container images on our nodes.
Once those checks are done, it would start by upgrading the etcd cluster – all at once, though this is done pretty quickly, the Kubernetes API did not seem to suffer from it.

Next, it would upgrade Kubernetes API services on the first master node. Setting it unschedulable if it was not already, draining it, upgrading Kubelet, the container runtime, making sure proper kernel modules are loaded, … Starting the new API, scheduler and controler Pods.
After upgrading the first master, parts of Kubepsray apps are redeployed, installing the last CSI & RBAC configurations. Then, the other masters are upgraded, one after the other.

Once all masters are up-to-date, Kubespray would upgrade the cluster SDN – in my case, Calico.
This goes pretty fast, the playbook applies changes on two nodes at once, I didn’t have much time to check for side effects – all in all, I didn’t see my apps suffer at that stage.

We’re now done with the most critical parts, and left with those that will affect availability for our hosted applications.

Kubespray would then go, one node after the other: cordon, drain, update runtime, restart services, uncordon.
The draining part can take a long time, depending on your nodes sizes and overall usage.

Every 10.0s: kubectl get nodes
NAME       STATUS                     ROLES    AGE    VERSION
compute1   Ready                      worker   314d   v1.18.8
compute2   Ready                      worker   314d   v1.18.8
compute3   Ready                      worker   314d   v1.18.8
compute4   Ready                      worker   314d   v1.18.8
infra1     Ready                      infra    314d   v1.18.9
infra2     Ready                      infra    314d   v1.18.9
infra3     Ready,SchedulingDisabled   infra    314d   v1.18.8
master1    Ready                      master   314d   v1.18.9
master2    Ready                      master   139d   v1.18.9
master3    Ready                      master   314d   v1.18.9

We could see failures – I did have the upgrade playbook crash once, due to one node drain step timing out, which led me to find a PodDisruptionBudget, preventing one Pod from being re-scheduled (KubeVirt). In such case, we may fix the issue, then re-apply the upgrade playbook – which would be a bit faster, though would still go through all steps that already completed.
To avoid these, I then connected on each node during its drain phase, made sure there were no Pod stuck in a Terminating state, or others left Running while the drain operation should be shutting them down.
Also note that re-applying the upgrade playbook, we could speed things up on nodes that were already processed by setting them unschedulable – in which case, the drain is skipped, container runtime and kubelet would still be restarted, with little to no effect on those workloads.

Once all nodes would be up-to-date, Kubespray would go through its Apps once again, applying the last metrics server, ingress controller, registry, depending on which ones you’ve enabled – in my case, I disabled most of those from my inventory prior upgrading, to skip those steps.

In about three hours, I was done with my first upgrade (10x 16G nodes cluster, overloaded). I could start over, from the next tag:

$ git checkout v2.14.1
$ git diff v2.14.0..v2.14.1 inventory/sample
$ vi inventory/mycluster/xxx
$ ansible-playbook -i ./inventory/mycluster/hosts.yaml ./upgrade-cluster.yml \
    | tee -a upgrade-$(date +%s)-from-1.18.8-to-1.18.9.log

This one went faster, a little under 2 hours. No error.
And the next ones:

$ git checkout v2.14.2
$ git diff v2.14.1..v2.14.2 inventory/sample
$ vi inventory/mycluster/xxx
$ ansible-playbook -i ./inventory/mycluster/hosts.yaml ./upgrade-cluster.yml \
    | tee -a upgrade-$(date +%s)-from-1.18.9-to-1.18.10.log
$ git checkout v2.15.0
$ git diff v2.14.2..v2.15.0 inventory/sample
$ vi inventory/mycluster/xxx
$ ansible-playbook -i ./inventory/mycluster/hosts.yaml ./upgrade-cluster.yml \
     | tee -a upgrade-$(date +%s)-from-1.18.10-to-1.19.7.log
$ git checkout v2.15.1
$ git diff v2.15.0..v2.15.1 inventory/sample
$ vi inventory/mycluster/xxx
$ ansible-playbook -i ./inventory/mycluster/hosts.yaml ./upgrade-cluster.yml \
     | tee -a upgrade-$(date +%s)-from-1.19.7-to-1.19.9.log

Having reached the last Kubespray release, we may finish with upgrading and rebooting our nodes. One after the other:

$ kubectl cordon node1
$ kubectl drain --delete-emptydir-data --ignore-daemonsets node1
$ ssh root@node1
# apt-get upgrade
# apt-get dist-upgrade
# reboot
$ kubectl uncordon node1

In my case, having disabled most Kubespray applications, I would also make sure the last Ingress Controller, Registry, RBD & CephFS Provisioner are up to date

$ vi roles/downloads/defaults/main.yaml
$ find roles/kubernetes-apps/ingress_controller/ingress_nginx/ -name '*.j2'
$ vi roles/kubernetes-apps/ingress_controller/ingress_nginx/templates/ds-ingress-nginx-controller.yml.j2
$ kubectl edit -n ingress-nginx ds/ingress-nginx

In the end, upgrading that cluster from 1.18.3 to 1.19.9 took me about 10 hours. Though I suspect I could have went straight to 1.19.9, and Kubespray v2.15.1. Being my first time with those playbooks, I would rather take my time and repeat until I’m confident enough with it.
And while it was not on my mind in the first place, I also took a couple hours to upgrade an other cluster of mine, 11 Raspberry Pi nodes I deployed in January, from 1.19.3 to 1.19.9.

Having played with Kubepsray for about a year, I was pretty confident following their docs and releases wouldn’t be an issue. Still, it’s a relief having gone through those upgrades.
Otherwise working with OpenShift 4, it’s kind of amazing to see a Kubernetes cluster upgrading without all the outages you would see with OpenShift: etcd operator and cluster upgrading, the Kubernetes API, the OpenShift API, the SDN, CSI, the OAuth operator, … nodes drain and reboot.
Kubespray upgrades are way smoother, you decide when to upgrade applications and operators. OpenShift 3 and openshift-ansible simplicity, without its unreliability.

Kubernetes & Ceph on Raspberry Pi

Having recently deployed yet-another Kubernetes cluster using Kubespray, experimenting with a Raspberry Pi lab, I’ve been looking with some issue I was last having, with Raspbian.

The issue being that Raspbian does not ship with the rbd kernel module, which is usually necessary attaching rbd devices out of a Ceph cluster.
One way to get around this would obviously be to rebuild the kernel, though I’m usually reluctant to do so.

Digging further, it appears that as an alternative to the rbd kernel module, we may use the nbd one, which does ship with Raspbian.


Here is how we may proceed:


# apt-get install ceph-common rbd-nbd
# scp -p root@mon1:/etc/ceph/ceph.conf /etc/ceph
# scp -p root@mon1:/etc/ceph/ceph.client.admin.keyring /etc/ceph
# rbd -p kube ls
# rbd-nbd map kube/kubernetes-dynamic-pvc-d213345c-c5e8-11ea-ab48-ae4bf5a40627
# mount /dev/nbd0 /mnt
# mount /mnt
# rbd-nbd unmap /dev/nbd0


Now, that’s a first step. The next one would to to get this working with Kubernetes.
Lately, the CSI (Container Storage Interface) is being promoted up to a point that the Kubernetes scheduler image no longer ships with Ceph binaries: a controller will be dealing devices provisioning, while another one would be attaching and releasing Ceph block devices on behalf of our nodes.

Fortunately, while it is not official yet, all of the container images required can be found, searching on GitHub and DockerHub.
I did publish a copy of the configuration files required setting up Ceph rbd provisioning and devices mapping on Kubernetes. Although it is mandatory for your nodes to run some arm64 versions of the Raspbian image, which is currently in beta, though has been working pretty well as far as I could see.


This is yet another victory for Kubernetes, against products such as OpenShift: Lightweight, modular, portable, easy to deploy, … Efficient.

Redeploy Kubernetes Nodes with KubeSpray

After suffering a disk outage, I had to reinstall a KVM node from scratch, and the Kubernetes master node it was hosting.

The process to deal with such an outage is fairly well documented, in the KubeSpray repository. Though I’ve had a couple complications we will discuss here.


Backups are for loosers. This is a lab I use for R&D, nothing I can’t redeploy. We can start with step two, provisioning new machines to serve as a replacement master, and a second one that would be an ingress node. We would re-use DNS names and IP addresses formerly used by the machines that were lost, to keep things simple.

We would then edit the inventory that was used to bootstrap our cluster. Make sure the faulty master node is the last one listed, in both kube-master and etcd host groups.

Then, the doc would tell you to “Move any broken etcd nodes into the broken_etcd group, make sure the etcd_member_name variable is set“. First remark here: don’t move out nodes of any group. Doc would then tell you to apply a playbook against the etcd and kube-master groups, it would make no sense for those not to include the node we’re recovering. Create the broken_kube-master and broken_etcd host groups, that would only include the name for the node we’re redeploying.

I assumed the etcd_member_name variable needed to be set for the node I was recovering, and though it would be clever to set it in a group_vars/broken_etcd.yaml, to avoid variables in inventories, or creating a host_vars directory. Beware this would break etcd configuration on all nodes. Somehow the broken_etcd variable applies to all members. In such case fixing the /etc/etcd.env configuration file should get you back up.

Once done preparing your inventory, we would proceed with applying the recovery playbook:

ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master -e etcd_retries=300 recover-control-plane.yml

After a short while, the recovering node was properly added back to the etcd cluster, though it was still unreachable from my other masters. Turns out the ETCD_INITIAL_ADVERTISE_PEER_URLS variable was not set, in the newly installed etcd.env. In system logs (or using netstat), you could see etcd only binding to the loopback interface. Adding the proper URL in there, then restarting etcd was enough, data got replicated pretty quickly.

Playbook crashed a little later. Etcd wasn’t an issue really, it kept running while I was fixing my configuratins, to then fail trying to re-deploy the Ceph RBD provisioner. I don’t want to do these anyway, I went back editing my group_vars, disabling any third-party integration (certmanager, ingress, ceph/cephfs, …).

At that stage, etcd is back up, instead of re-applying the recovery playbook, I went back to the usual cluster.yml, which did the rest:

ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master cluster.yml

In the end, we can see the node was re-created, though API still lists it as NotReady. Checking logs, calico was failing to start: the file installed in /etc/cni/net.d/10-calico.conflist doesn’t look right. Checking on my other nodes, I could confirm they all have the same copy, which I installed over that provisioned by KubeSpray. Wait a minute, and the node is back up.

Next, as I also lost an ingress node, and to try something new, I would apply the scale.yml playbook. Note this implies the node would not be delete from the API, and we’re expecting its old certificates being re-used:

ansible-playbook -i inventory/my/hosts.yml scale.yml

In the end, we can confirm, on the node we re-deployed, the /etc/ssl/etcd/ssl/ holds a pair of files, dating back from initial node deployment. Also, I’m pleased to see calico properly started up, no need to patch configurations this time.

I was still left with containers in CrashLoopBackOff, unable to contact the Kubernetes API, on the nodes I redeployed. I first tried to restart the corresponding Calico containers, then those still unable to start. Nothing I wouldn’t have seen with OpenShift.


From switching off my faulty KVM host to change its disk, re-install it using PXE, re-deploy it using puppet, then re-deploying Kubernetes nodes, I lost about 3 hours. Nevertheless, this was a good experience.

KubeSpray might not be perfect, though having worked with OpenShift 3, I don’t mind debugging playbooks as they run – and frankly, this went so much better than the usual OpenShift crash recovery.

Deploying Kubernetes with KubeSpray

I should first admit OpenShift 4 is slowly recovering from its architectural do-over. I’m still missing something that would be production ready, and quite disappointed by the waste of resources, violent upgrades, broken CSI, somewhat unstable RH-CoreOS, a complicated deployment scheme when dealing with bare-metal, … among lesser critical bugs.

OpenShift 3 is still an interesting platform hosting production workloads, although its being based on Kubernetes 1.11 makes it quite an old version already.

After some experimentation on a Raspberry-Pi lab, I figured I would give Kubernetes a try on x86. Doing so, I would be looking at KubeSpray.


If you’re familiar with OpenShift 3 cluster deployments, you may have been using openshift-ansible already. Kube-spray is a similar solution, focused on Kubernetes, simplifying the process of bootstrapping, scaling and upgrading highly available clusters.

Currently, kube-spray allows for deploying Kubernetes with container runtimes such as docker, cri-o, containerd, SDN based on flannel, weave, calico, … as well as a registry, some nginx based ingress controller, certs manager controller, integrated metrics, or the localvolumes, rbd and cephfs provisioner plugins.

Comparing with OpenShift 4, the main missing components would be the cluster and developer consoles, RBAC integrating with users and groups from some third-party authentication provider. Arguably, the OLM, though I never really liked that one — makes your operators deployment quite abstract, and complicated to troubleshoot, as it involves several namespaces and containers, … The Prometheus Operator, that could still be deployed manually.
I can confirm everything works perfectly deploying on Debian Buster nodes, with containerd and calico. Keeping pretty much all defaults in place and activating all addons.


The sample variables shipping with kube-spray are pretty much on point. We would create an inventory file, such as the following:

        infra.utgb/zone: momos-adm
        infra.utgb/zone: thaoatmos-adm
        infra.utgb/zone: moros-adm
      node_labels: “true”
        infra.utgb/zone: momos-adm
      node_labels: “true”
        infra.utgb/zone: thanatos-adm
      node_labels: “true”
        infra.utgb/zone: moros-adm
      node_labels: “true”
        infra.utgb/zone: momos-adm
      node_labels: “true”
        infra.utgb/zone: moros-adm
      node_labels: “true”
        infra.utgb/zone: momos-adm
      node_labels: “true”
        infra.utgb/zone: moros-adm
        hosts: {}

Then, we’ll edit the sample group_vars/etcd.yml:

etcd_compaction_retention: “8”
etcd_metrics: basic
etcd_memory_limit: 5GB
etcd_quota_backend_bytes: 2147483648
# ^ WARNING: sample var tells about “2G”
# which results in etcd not starting (deployment_type=host)
# journalctl shows errors such as:
# > invalid value “2G” for ETCD_QUOTA_BACKEND_BYTES: strconv.ParseInt: parsing “2G”: invalid syntax
# Also note: here, I’m setting 20G, not 2.
etcd_deployment_type: host

Next, common variables in group_vars/all/all.yml:

etcd_data_dir: /var/lib/etcd
bin_dir: /usr/local/bin
kubelet_load_modules: true
additional_no_proxy: “*,”
http_proxy: “”
https_proxy: “{{ http_proxy }}”
download_validate_certs: False
cert_management: script
download_container: true
deploy_container_engine: true
  port: 6443
loadbalancer_apiserver_localhost: false
loadbalancer_apiserver_port: 6443

We would also want to customize the variables in group_vars/k8s-cluster/k8s-cluster.yml:

kube_config_dir: /etc/kubernetes
kube_script_dir: “{{ bin_dir }}/kubernetes-scripts”
kube_manifest_dir: “{{ kube_config_dir }}/manifests”
kube_cert_dir: “{{ kube_config_dir }}/ssl”
kube_token_dir: “{{ kube_config_dir }}/tokens”
kube_users_dir: “{{ kube_config_dir }}/users”
kube_api_anonymous_auth: true
kube_version: v1.18.3
kube_image_repo: “”
local_release_dir: “/tmp/releases”
retry_stagger: 5
kube_cert_group: kube-cert
kube_log_level: 2
credentials_dir: “{{ inventory_dir }}/credentials”
kube_api_pwd: “{{ lookup(‘password’, credentials_dir + ‘/kube_user.creds length=15 chars=ascii_letters,digits’) }}”
    pass: “{{ kube_api_pwd }}”
    role: admin
    – system:masters
kube_oidc_auth: false
kube_basic_auth: true
kube_token_auth: true
kube_network_plugin: calico
kube_network_plugin_multus: false
kube_network_node_prefix: 24
kube_apiserver_ip: “{{ kube_service_addresses|ipaddr(‘net’)|ipaddr(1)|ipaddr(‘address’) }}”
kube_apiserver_port: 6443
kube_apiserver_insecure_port: 0
kube_proxy_mode: ipvs
# using metallb, set to true
kube_proxy_strict_arp: false
kube_proxy_nodeport_addresses: []
kube_encrypt_secret_data: false
cluster_name: cluster.local
ndots: 2
kubeconfig_localhost: true
kubectl_localhost: true
dns_mode: coredns
enable_nodelocaldns: true
nodelocaldns_health_port: 9254
enable_coredns_k8s_external: false
coredns_k8s_external_zone: k8s_external.local
enable_coredns_k8s_endpoint_pod_names: false
system_reserved: true
system_memory_reserved: 512M
system_cpu_reserved: 500m
system_master_memory_reserved: 256M
system_master_cpu_reserved: 250m
deploy_netchecker: false
skydns_server: “{{ kube_service_addresses|ipaddr(‘net’)|ipaddr(3)|ipaddr(‘address’) }}”
skydns_server_secondary: “{{ kube_service_addresses|ipaddr(‘net’)|ipaddr(4)|ipaddr(‘address’) }}”
dns_domain: “{{ cluster_name }}”
kubelet_deployment_type: host
helm_deployment_type: host
kubeadm_control_plane: false
kubeadm_certificate_key: “{{ lookup(‘password’, credentials_dir + ‘/kubeadm_certificate_key.creds length=64 chars=hexdigits’) | lower }}”
k8s_image_pull_policy: IfNotPresent
kubernetes_audit: false
dynamic_kubelet_configuration: false
default_kubelet_config_dir: “{{ kube_config_dir }}/dynamic_kubelet_dir”
dynamic_kubelet_configuration_dir: “{{ kubelet_config_dir | default(default_kubelet_config_dir) }}”
– Node
podsecuritypolicy_enabled: true
container_manager: containerd
resolvconf_mode: none
etcd_deployment_type: host

Finally, we may enable additional components in group_vars/k8s-cluster/addons.yml:

dashboard_enabled: true
helm_enabled: false

registry_enabled: false
registry_namespace: kube-system
registry_storage_class: rwx-storage
registry_disk_size: 500Gi

metrics_server_enabled: true
metrics_server_kubelet_insecure_tls: true
metrics_server_metric_resolution: 60s
metrics_server_kubelet_preferred_address_types: InternalIP

cephfs_provisioner_enabled: true
cephfs_provisioner_namespace: cephfs-provisioner
cephfs_provisioner_cluster: ceph
cephfs_provisioner_monitors: “,,”
cephfs_provisioner_admin_id: admin
cephfs_provisioner_secret: key returned by ‘ceph auth get client.admin’
cephfs_provisioner_storage_class: rwx-storage
cephfs_provisioner_reclaim_policy: Delete
cephfs_provisioner_claim_root: /volumes
cephfs_provisioner_deterministic_names: true

rbd_provisioner_enabled: true
rbd_provisioner_namespace: rbd-provisioner
rbd_provisioner_replicas: 2
rbd_provisioner_monitors: “,,”
rbd_provisioner_pool: kube
rbd_provisioner_admin_id: admin
rbd_provisioner_secret_name: ceph-secret-admin
rbd_provisioner_secret: key retured by ‘ceph auth get client.admin’
rbd_provisioner_user_id: kube
rbd_provisioner_user_secret_name: ceph-secret-user
rbd_provisioner_user_secret: key returned by ‘ceph auth gt client.kube’
rbd_provisioner_user_secret_namespace: “{{ rbd_provisioner_namespace }}”
rbd_provisioner_fs_type: ext4
rbd_provisioner_image_format: “2”
rbd_provisioner_image_features: layering
rbd_provisioner_storage_class: rwo-storage
rbd_provisioner_reclaim_policy: Delete

ingress_nginx_enabled: true
ingress_nginx_host_network: true
ingress_publish_status_address: “”
ingress_nginx_nodeselector: “true”
ingress_nginx_namespace: ingress-nginx
ingress_nginx_insecure_port: 80
ingress_nginx_secure_port: 443
  map-hash-bucket-size: “512”

cert_manager_enabled: true
cert_manager_namespace: cert-manager

We now have pretty much everything ready. Last, we would deploy some haproxy node, proxying requests to Kubernetes API. To do so, I would use a pair of VMs, with keepalived and haproxy. On both, install necessary packages and configuration:

apt-get update ; apt-get install keepalived haproxy hatop
cat << EOF>/etc/keepalived/keepalived.conf
global_defs {
  notification_email {
 notification_email_from keepalive@$(hostname -f)
  smtp_connect_timeout 30

vrrp_instance VI_1 {
  state MASTER
  interface ens3
  virtual_router_id 101
  priority 10
  advert_int 101
  authentication {
    auth_type PASS
    auth_pass your_secret
  virtual_ipaddress {
echo net.ipv4.conf.all.forwarding=1 >>/etc/sysctl.conf
sysctl -w net.ipv4.conf.all.forwarding=1
systemctl restart keepalived && systemctl enable keepalived
#hint: use distinct priorities on nodes
cat << EOF>/etc/haproxy/haproxy.cfg
  log /dev/log local0
  log /dev/log local1 notice
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/private
  ssl-default-bind-options no-sslv3

  log global
  option dontlognull
  timeout connect 5000
  timeout client 50000
  timeout server 50000
  errorfile 400 /etc/haproxy/errors/400.http
  errorfile 403 /etc/haproxy/errors/403.http
  errorfile 408 /etc/haproxy/errors/408.http
  errorfile 500 /etc/haproxy/errors/500.http
  errorfile 502 /etc/haproxy/errors/502.http
  errorfile 503 /etc/haproxy/errors/503.http
  errorfile 504 /etc/haproxy/errors/504.http

listen kubernetes-apiserver-https
  mode tcp
  option log-health-checks
  server master1 check check-ssl verify none inter 10s
  server master2 check check-ssl verify none inter 10s
  server master3 check check-ssl verify none inter 10s
  balance roundrobin
systemctl restart haproxy && systemctl enable haproxy
cat << EOF>/etc/profile.d/
alias hatop=’hatop -s /run/haproxy/admin.sock’

We may now deploy our cluster:

ansible -i path/to/inventory cluster.yml

For a 10 nodes cluster, it shouldn’t take more than an hour.
It is quite nice, to see you can have some reliable Kubernetes deployment, with less than 60 infra Pods.

I’m also noticing that while the CSI provisioner is being used, creating Ceph RBD and CephFS volumes: the host is still in charge of mounting our those volumes – which is, in a way, a workaround to the CSI attacher plugins.
Although, on that note, I’ve heard those issues with blocked volumes during nodes failures was in its way to being solved, involving a fix to the CSI spec.
Sooner or later, we should be able to use the full CSI stack.

All in all, kube-spray is quite a satisfying solution.
Having struggled quite a lot with openshift-ansible, and not quite yet satisfied with their lasts installer, kube-spray definitely feels like some reliable piece of software, code is well organized, it goes straight to the point, …
Besides, I need a break from CentOS. I’m amazed I did not try it earlier.

Migrating OpenShift 3 Container Runtime

While reaching its end of life, OpenShift 3 remains widely used, and in some cases still more reliable than its successor, OpenShift 4.

OpenShift was historically built on top of Docker, and introduced support for Cri-O, an alternative container runtime. Cri-o integration into OpenShift reached GA with its release 3.9, mid 2018 — based on Kubernetes 1.9 & Cri-o 1.9. Although it has not been without a few hiccups.

As of today, there are still a few bugs involving RPC overflows, when lots of containers are running on a Cri-O nodes, that could result in some operations, addressing all containers, to fail – eg: drains. Or some SDN corruptions, that I suspect to be directly related with Cri-O. Pending RFE to implement SELinux audit logging, similar to what already exists for Docker, … And the fact OpenShift 4 drops Docker support, while ideologically commendable, is quite a bold move right now, considering the youth of Cri-O.


Lately, a customer of mine contacted me regarding a cluster, as I did help them to deploy it. Mid 2019, an architect recommended the with OpenShift 3.11, Cri-O, and GlusterFS CNS storage – aka OCS, OpenShift Container Storage. We did set it up, cluster has been running for almost a year now, when customer opened a case with their support, complaining about an issue with GlusterFS containers behaving unexpectedly.

After a few weeks of troubleshooting, support got back to customer, arguing their setup was not supported, pointing us to a KB item none of us was aware of so far: while OpenShift 3.11 is fully supported with both Cri-O and GlusterFS CNS storage, their combination is not: only Docker, may be used with GlusterFS.

When realizing this, we had to come up with a plan, migrating container runtime from Cri-O to Docker, on any OpenShift node hosting GlusterFS, so support would keep investigating the original issue. Lacking any documentation covering such a migration, I’ve been deploying a lab, reproducing my customer’s cluster.


We will simplify it to an 11 nodes cluster: 3 masters, 3 gluster, 3 ingress, 2 computes. The GlusterFS nodes would also be hosting Prometheus and Hawkular. The Ingress nodes would host the Docker registry and OpenShift routers. We would also deploy a Git server and a few dummy Pods on the compute nodes, hosting some sources and generating activity on GlusterFS backed persistent volumes.

Having reproduced customer’s setup as close as I could, I would then repeat the following process, re-deploying all my GlusterFS nodes. First, let’s pick a node and drain it:

$ oc adm cordon gluster1.demo
$ oc adm drain gluster1.demo --ignore-daemonsets --delete-local-data

Next, we will connect that node, stop OpenShift services, container runtime, dnsmasq, purge some packages, … It will not clean up everything, though would be good enough for us:

# systemctl stop atomic-openshift-node
# systemctl stop crio
# systemctl stop docker
# systemctl disable atomic-openshift-node
# systemctl disable crio
# systemctl disable docker
# grep BOOTSTRAP_CONFIG /etc/sysconfig/atomic-openshift-node
# cp -f /etc/origin/node/resolv.conf /etc/
# systemctl stop dnsmasq
# systemctl disable dnsmasq
# yum -y remove criu docker atomic-openshift-excluder atomic-openshift-docker-excluder cri-tools \
    atomic-openshift-hyperkube atomic-openshift-node docker-client cri-o atomic-openshift-clients \
# rm -fr /etc/origin /etc/dnsmasq.d/* /etc/sysconfig/atomic-openshift-node.rpmsave
# reboot

Once node would have rebooted, we may connect back, confirm DNS resolution still works, that container runtimes are gone, … Then we will delete the node from the API:

$ oc delete node gluster1.demo

Next, we would edit our Ansible inventory, reconfiguring that node to only use Docker. In the inventory file, we would add to that node variables some openshift_use_crio=False, overriding some default defined in our group_vars/OSEv3.yaml.

We would also change the openshift_node_group_name variable, to remove the Cri-o specifics from that node kubelet configuration. Note, in some cases, this could involved editing some custom openshift_node_groups definition. For most common deployments, we may only switch the node group name from a crio variant to its docker equivalent (eg: from node-config-infra-crio to node-config-infra).

Finally, still editing Ansible inventory, we would move our migrating node definition, out of the nodes group, and into the new_nodes one — doing so, if you never had to scale that cluster before, be careful that group should inherit your custom OSEv3 settings, maybe set it as children of the OSEv3 host group, though make sure it’s not a member of the node one. At that stage, it is also recommended to have fixed both OpenShift and GlusterFS versions, up to their patch number — in our case, we’re using OCP 3.11.161, OCS 3.11.4.

Make the the node groups configuration is up to date:

$ oc delete -n openshift-node custom-node-group-gfs1 #not necessary if using default node groups
$ ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-master/openshift_node_group.yml

Then, we may proceed as if adding a new node to our cluster:

$ ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml

As soon as the node would have joined back our cluster, the GlusterFS container we were missing should start, using the exact same local volumes and configuration, only now it uses Docker.

Once that GlusterFS Pod is marked back healthy, rsh into any GlusterFS container and query for your volumes health:

$ oc rsh -n glusterfs-namespace ds/glusterfs-clustername
sh-4.2# gluster volume list | while read vol; do
gluster volume heal $vol info;

Internal healing mechanisms may not fix all issues, be sure your cluster is healthy before migrating another node. Meanwhile, we would edit back Ansible inventory and make sure to move our node, out of the new_nodes group and back into its original location.

Repeat with all node you need to migrate. Eventually, the openshift_use_crio definition could be moved into some host group settings, avoiding multiple definitions in nodes variables.

To further confirm we were not leaving the cluster in some inconsistent state, I’ve later upgraded that lab, to OCP 3.11.200 and OCS 3.11.5, with only one outstanding note: the atomic-openshift-excluder package was missing, on the nodes I did migrate. While it is installed during cluster deployment, it appears this is not the case during cluster scale outs. Could be a bug with openshift-ansible roles or playbooks: in doubt, make sure to install that package manually afterwards.


Overall, everything went great. While undocumented, this process is nothing extraordinary.

As of migrating to Docker-backed GlusterFS containers, I did reproduce that issue customer was complaining about. As well as another one, related to GlusterFS arbiter bricks space exhaustion.

Thank science, OCS4 is now based on Rook, and Ceph.


Today we’ll take a quick look at KubeVirt, A Kubernetes native virtualization solution.

While OpenShift and Kubernetes have been all about containers, as of 2018, we’ve started hearing about some weird idea: shipping virtual machines into containers.

Today, KubeVirt is fairly well integrated into OpenShift, which has its own Operator.

If like me, you’re running OpenShift on KVM guests, you’ll first have to make sure nested virtualization was enabled. With an Intel processor, we would look for the following:

$ cat /sys/module/kvm_intel/parameters/nested

Or using AMD:

$ cat /sys/module/kvm_amd/parameters/nested

Unless the above returns with `Y` or `1`, we need to enable nested
virtualization. First, shut down all guests. Then, reload your KVM module:

# modprobe -r kvm_intel
# modprobe kvm_intel nested=1
# cat /sys/module/kvm_intel/parameters/nested
# cat </etc/modprobe.d/kvm.conf
options kvm_intel nested=1

With AMD, use instead:

# modprobe -r kvm_amd
# modprobe kvm_amd nested=1
# cat /sys/module/kvm_amd/parameters/nested
# cat </etc/modprobe.d/kvm.conf
options kvm_amd nested=1

Reboot your guests, and confirm you can now find a `/dev/kvm` device:

$ ssh core@compute1.friends
Red Hat Enterprise Linux CoreOS 42.81.20191113.0
$ grep vmx /proc/cpuinfo
flags : xxx
$ ls /dev/kvm

Confirm OpenShift node-capability-detector did discover those devices:

$ oc describe node
cpu: 7500m 110 110 110

Now, from the OperatorHub console, we would install the KubeVirt operator. While writing these lines, there are still some bugs, prefer using some lab cluster doing so.

Next, we’ll migrate a test KVM instance, from a regular hypervisor to OpenShift. Here, the first thing we would want to do is to provision a DataVolume.

DataVolumes are built on top of PersistentVolumeClaims, they’re meant to help dealing with persistent volumes, implementing data provisioning.

There’s two ways to go about this: either we can host our disks using a web server, and then we may use the following DataVolume definition:

kind: DataVolume
  name: bluemind-demo
  namespace: wsweet-demo
    - ReadWriteOnce
      storage: 20Gi

Or we could use the virtctl client uploading an image from our system into OpenShift:

$ virtctl image-upload dv bluemind-demo --wait-secs=600 --size=8Gi --insecure --block-volume --image-path=/var/lib/libvirt/images/bm40-template.raw
DataVolume wsweet-demo/bluemind-demo created
Waiting for PVC bluemind-demo upload pod to be ready...
Pod now ready
Uploading data to

The process of uploading a volume would start some temporary Pod, which would use a pair of PVC: one that would receive the final image, the other serving as a temporary storage while upload is running.

Once our image was uploaded, we would be able to create a VirtualMachine object:

kind: VirtualMachine
  name: bluemind-demo
  namespace: wsweet-demo
  running: false
        name: bluemind-demo
          - disk:
            bus: virtio
          name: rootfs
          - name: default
            masquerade: {}
              memory: 8Gi
              cpu: "1"
      - name: default
        pod: {}
terminationGracePeriodSeconds: 600
      - dataVolume:
          name: bluemind-demo
        name: rootfs

$ oc get vm
bluemind-demo 2s false
$ virtctl start bluemind-demo
$ oc describe vm bluemind-demo
$ oc get vmi
bluemind-demo 3s Scheduling
$ oc get pods
virt-launcher-bluemind-demo-8kcxz 0/1 ContainerCreating 0 38s

Once that Pod is running, we should be able to attach our guest VNC console:

$ virtctl vnc bluemind-demo

Finish up configuring your system, you may have to rename your network
interfaces, reset IP addresses, fix DNS resolution integrating with OpenShift. Here, we could use cloud-init, or script our own contextualization, installing OpenShift Service CA, …

OpenShift 4 – Baremetal Deployment

Once again, quick post regarding OpenShift, today experimenting with the new installer, and OpenShift 4.

First, let’s remind ourselves that OKD 4 has not yet been released. I would be using my RedHat account credentials pulling images. I usually refuse to touch anything that is not strictly open source (and freely distributed), though I would make an exception here, as I’ve been waiting for OpenShift 4 for almost a year now. Back when my first OpenShift PR got refused, due to their focus being on OpenShift 4, … Now I’m visiting customers for OpenShift 4, I need my own lab to experiment with.

Prepare Hardware

Dealing with a baremetal deployment, we would need to prepare a subnet with its DHCP and PXE servers, a pair of LoadBalancers, and several instances for OpenShift itself.
The following would assume a VLAN was created, we would provide with isc-dhcp-server, tftpd-hpa, bind/nsd and haproxy configuration snippets.

OpenShift nodes would include a bootstrap node (only required during deployment, would be shut down afterwards), three master nodes, and as much worker nodes as we can allocate.
Bootstrap and master nodes should ship with 4 vCPU and 16G RAM at least, while workers could go with 2 vCPU and 8G RAM. Docs mention provisioning those node with at least 120G of disk storage, though this does not seem to be mandatory.
Those nodes would be running on top of KVM hypervisors.

Download Assets

We would start downloading a few assets out of RedHat cloud portal.

We would find links to RedHat CoreOS PXE sources – a kernel, an initramfs, and a pair of compressed filesystems that would be used installing CoreOS to our nodes. We would install those to our PXE server later.

We would also fetch a pull secret, that would allow us downloading images out of RedHat and Quay registries.

Finally, we would retrieve the latest oc client, as well as the openshift-install binaries.


Next, we would prepare DNS records for our OpenShift cluster and nodes.

Contrarily to OpenShift3, we would not be able to use customized names for the cluster API or its applications. 

We would first create a zone for cluster host names,

bootstrap A
master1 A
master2 A
master3 A
infra1 A
infra2 A
infra3 A
compute1 A
compute2 A
compute3 A
compute4 A
compute5 A
haproxy1 A
haproxy2 A

Next, we would create a zone for the cluster itself,

api A
api A
api-int A
api-int A
*.apps A
*.apps A
etcd-0 A
etcd-1 A
etcd-2 A
_etcd-server-ssl._tcp 86400 IN SRV 0 10 2380
_etcd-server-ssl._tcp 86400 IN SRV 0 10 2380
_etcd-server-ssl._tcp 86400 IN SRV 0 10 2380

And corresponding reverse records, in

10 PTR
11 PTR
12 PTR
13 PTR
14 PTR
15 PTR
20 PTR
21 PTR
22 PTR
23 PTR
24 PTR
150 PTR
151 PTR

Don’t forget to reload your zones before going further.


Next, we would configure our DHCP server. First, we would setup static leases for our OpenShift nodes:

host bootstrap-eth0 {
    hardware ethernet 52:54:00:e1:48:6a;
host master0-eth0 {
    hardware ethernet 52:54:00:be:c0:a4;
host master1-eth0 {
    hardware ethernet 52:54:00:79:f3:0f;
host master2-eth0 {
    hardware ethernet 52:54:00:69:74:8c;
host infra1-eth0 {
    hardware ethernet 52:54:00:d3:40:dc;
host infra2-eth0 {
    hardware ethernet 52:54:00:20:f0:af;
host infra3-eth0 {
    hardware ethernet 52:54:00:81:83:25;
host compute1-eth0 {
    hardware ethernet 52:54:00:48:77:48;
host compute2-eth0 {
    hardware ethernet 52:54:00:88:94:94;
host compute3-eth0 {
    hardware ethernet 52:54:00:ff:37:14;
host compute4-eth0 {
    hardware ethernet 52:54:00:c7:46:2d;
host compute5-eth0 {
    hardware ethernet 52:54:00:e1:60:5b;

Next, we would setup a subnet for OpenShift nodes, enabling with PXE booting options:

subnet netmask
    option routers;
    option domain-name “”;
    option domain-name-servers,;
    filename “pxelinux.0”;

Don’t forget to restart your DHCP server.


Now, we would generate some configurations to be served to PXE clients.

First, we would create a configuration file, mandatory for baremetal deployments, install-config.yaml:

apiVersion: v1
– hyperthreading: Enabled
  name: worker
replicas: 0
  hyperthreading: Enabled
  name: master
  replicas: 3
  name: intra
  – cidr:
    hostPrefix: 23
  networkType: OpenShiftSDN
  none: {}
pullSecret: <>
sshKey: ‘ssh-rsa <some-public-key-of-yours>’

If you haven’t already, extract the openshift-install binary from the archive downloaded out of RedHat cloud portal.

mkdir install-directory
cp -p install-config.yaml install-directory/
./openshift-install create manifests –dir=./install-directory
sed -i ‘s|mastersSchedulable:.*|mastersSchedulable: false|’ \
./openshift-install create ignition-configs –dir=./install-directory/
scp -p install-directory/*.ign root@pxe-server:/srv/tftpboot/ocp4/

Note that the install-directory/auth subfolder includes a kubeconfig file, that can be used with the oc and kubectl clients, querying our cluster API, as well as kubeadmin default password logging into the cluster console.


Next, we would configure our PXE server booting RedHat CoreOS nodes.

wget -o /srv/tftpboot/ocp4/kernel \
wget -o /srv/tftpboot/ocp4/initrd \
wget -o /srv/tftpboot/ocp4/metalbios.raw.gz \
cat <<EOF >/srv/tftproot/boot-screens/ocp4.cfg
menu title OCP4 RH-CoreOS Systems
  menu title OCP4 RH-CoreOS Systems
    menu label OCP4 RH-CoreOS Systems
    menu exit
  label –
    menu label 4.2.0 x86_64 – bootstrap
    kernel installers/ocp4-rhcos-4.2.0/x86_64/linux
    append initrd=installers/ocp4-rhcos-4.2.0/x86_64/initrd-raw ip=dhcp rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url= coreos.inst.ignition_url=
  label –
    menu label 4.2.0 x86_64 – master
    kernel installers/ocp4-rhcos-4.2.0/x86_64/linux
    append initrd=installers/ocp4-rhcos-4.2.0/x86_64/initrd-raw ip=dhcp rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url= coreos.inst.ignition_url=
  label –
    menu label 4.2.0 x86_64 – worker
    kernel installers/ocp4-rhcos-4.2.0/x86_64/linux
    append initrd=installers/ocp4-rhcos-4.2.0/x86_64/initrd-raw ip=dhcp rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url= coreos.inst.ignition_url=
menu end

Note that our PXE server also includes its HTTP server, hosting ignition configs and CoreOS installation image URL. In theory, all you need here is an HTTP server, not necessarily related to your PXE server.

Load Balancers

Before we can deploy OpenShift, we would setup its LoadBalancers. Here, we would use HAProxy, with the following configuration:

  maxconn 20000
  log /dev/log local0 info
  chroot /var/lib/haproxy
  pidfile /var/run/
  user haproxy
  group haproxy
  stats socket /var/lib/haproxy/stats

  mode http
  log global
  option httplog
  option dontlognull
  option forwardfor except
  option redispatch
  retries 3
  timeout http-request 10s
  timeout queue 1m
  timeout connect 10s
  timeout client 300s
  timeout server 300s
  timeout http-keep-alive 10s
  timeout check 10s
  maxconn 20000

listen stats
  bind :9000
  mode http
  stats enable
  stats uri /

frontend k8s-api
  bind *:6443
  default_backend k8s-api
  mode tcp
  option tcplog

backend k8s-api
  balance source
  mode tcp
  server bootstrap check
  server master0 check
  server master1 check
  server master2 check

frontend machine-config-server
  bind *:22623
  default_backend machine-config-server
  mode tcp
  option tcplog

backend machine-config-server
  balance source
  mode tcp
  server bootstrap check
  server master0 check
  server master1 check
  server master2 check

frontend apps-tls
  bind *:443
  default_backend apps-tls
  mode tcp
  option tcplog

backend apps-tls
  balance source
  mode tcp
  server router0 check
server router1 check
  server router2 check

frontend apps-clear
  bind *:80
  default_backend apps-clear
  mode tcp
  option tcplog

backend apps-clear
  balance source
  mode tcp
  server router0 check
  server router1 check
  server router2 check

Don’t forget to start and enable HAProxy service.

Boot Instances

Now we should have everything we need. First boot the boostrap node using PXE, wait for it to reboot, then boot the three master nodes in PXE.

We would be able to SSH to each node, as the core user, using the SSH key passed to openshift-install earlier. Keep an eye on system logs.

Meanwhile, we could use openshift-install tracking for OpenShift API bootstrap completion:

./openshift-install –dir=./install-directory wait-for bootstrap-complete \
    log-level info

Eventually, that command would exit, and should confirm our cluster API is now reachable. At that stage, the cluster is not yet done deploying, though we’re getting close.

Next, we would boot our infra nodes in PXE. Keep an eye on certificate signing requests, as we would need to approve those new nodes while joining the cluster:

oc get csr
oc adm certificate sign csr-xxx

Eventually, we should be able to confirm the cluster operators are finishing to deploy.

The only one that would stay in a degraded state would be the image registry operator. Here, we would need to define OpenShift integrated registry storage configuration:

oc edit

To keep it simple, we would stick to an emptyDir storage (volatile), which is not usually recommended.

oc get co
authentication 4.2.0 True False False 1h36m
cloud-credential 4.2.0 True False False 2h
cluster-autoscaler 4.2.0 True False False 1h56m
console 4.2.0 True False False 1h37m
dns 4.2.0 True False False 2h
image-registry 4.2.0 True False False 49m
ingress 4.2.0 True False False 1h42m
insights 4.2.0 True False False 2h
kube-apiserver 4.2.0 True False False 1h59m
kube-controller-manager 4.2.0 True False False 1h58m
kube-scheduler 4.2.0 True False False 1h59m
machine-api 4.2.0 True False False 2h
machine-config 4.2.0 True False False 2h
marketplace 4.2.0 True False False 1h56m
monitoring 4.2.0 True False False 1h40m
network 4.2.0 True False False 2h
node-tuning 4.2.0 True False False 1h56m
openshift-apiserver 4.2.0 True False False 1h57m
openshift-controller-manager 4.2.0 True False False 1h59m
openshift-samples 4.2.0 True False False 1h55m
operator-lifecycle-manager 4.2.0 True False False 2h
operator-lifecycle-manager-catalog 4.2.0 True False False 2h
operator-lifecycle-manager-packageserver 4.2.0 True False False 1h58m
service-ca 4.2.0 True False False 2h
service-catalog-apiserver 4.2.0 True False False 1h56m
service-catalog-controller-manager 4.2.0 True False False 1h57m
storage 4.2.0 True False False 1h56m

Eventually, we may boot workers using PXE, until all nodes joined our cluster. We can also terminate the bootstrap node, that is no longer needed.

LDAP Authentication

Finally, we would setup LDAP authentication. By default, OpenShift4 ships with a single kubeadmin user, that could be used during initial cluster configuration.

oc –config ./kubeconfig create secret generic ldap-secret \
    –from-literal=bindPassword=<secret> -n openshift-config
oc –config ./kubeconfig create configmap ldap-ca \
    –from-file=ca.crt=/path/to/ldap-ca-chain.crt -n openshift-config

Having create a Secret with our OpenShift LDAP service account bind password, and a ConfigMap serving the CA chain, used to sign our OpenLDAP TLS certificate, we would then import the following OAuth configuration:

kind: OAuth
  name: cluster
  – name: LDAP
    mappingMethod: claim
    type: LDAP
        – dn
        – mail
        – sn
        – uid
      bindDN: “cn=openshift,ou=services,dc=example,dc=com”
        name: ldap-secret
        name: ldap-ca
      insecure: false
      url: “ldaps://,dc=example,dc=com?uid?sub?(&(objectClass=inetOrgPerson)(!(pwdAccountLockedTime=*)))”

Having applied that configuration, we would see Pods from the openshift-authentication namespace rebooting. We would then be able to log in using an LDAP account.

OpenShift4 Dashboard

OpenShift4 Dashboard

Infra Nodes

Last detail: after deployment, an OpenShift 4 cluster would include master and worker nodes, while OpenShift 3 used to ship with master, infra and compute nodes.

The worker nodes in OpenShift 4 are meant to replace both infra and computes, which could make sense running smaller setups, though I would argue is not much practical scaling out. Having a small set of nodes, designated to host OpenShift ingress controllers is a good thing, as we only need to configure those IPs as backends for our applications loadbalancers. Say we only rely on worker nodes, every time we add new members to our cluster, we would also need reconfiguring our loadbalancer.

Hence, we would create a group of Infra machines, starting with creating a MachineConfigPool, using the following cofiguration:

kind: MachineConfigPool
  name: infra
    matchLabels: infra
    matchLabels: “”
  paused: false

Having applied that configuration, we would then dump MachineConfig objects applying to worker nodes:

DUMP=$(oc get machineconfig | grep -v rendered | \
  awk ‘/worker/{print $1}’ | tr ‘\n’ ‘ ‘)

oc get machineconfig -o yaml $DUMP >machineconfig-infra.yaml

We would then edit machineconfig-infra.yaml content, removing “generated-by” annotations, creationTimestamps, generation, ownerReferences, resourceVersions, selfLink and uid metadata. Replace any remaning mention of “worker” by “infra”. Then apply the resulting objects:

oc apply -f machineconfig-infra.yaml
oc get mc
00-infra 2.2.0 1m
01-infra-container-runtime 2.2.0 1m
01-infra-kubelet 2.2.0 1m
99-infra-ad9f8790-f270-11e9-a34e-525400e1605b-registries 2.2.0 1m
99-infra-ssh 2.2.0 1m

At that stage, the MachineConfig Operator should be rendering a last MachineConfig object, including an exhaustive list of configurations for our infra nodes. Once oc get mc includes that rendered configuration, we would make sure the MachineConfig Operator is done with our MachineConfigPool and start re-labeling nodes accordingly:

oc get mcp
infra rendered-infra-0506920a222781a19fff88a4196deef4 True False False
master rendered-master-747943425e64364488e51d15e5281265 True False False
worker rendered-worker-5e70256103cc4d0ce0162430de7233a1 True False False
oc label node
node/ labeled
oc label node
node/ labeled

From there, our node would be set unschedulable, drained, and rebooted. Our customized MachineConfig should have changed the role label applied when our node boots, which we may confirm once it is done restarting

oc get nodes Ready worker 47m v1.14.6+c07e432da Ready worker 45m v1.14.6+c07e432da Ready worker 34m v1.14.6+c07e432da Ready worker 33m v1.14.6+c07e432da Ready worker 31m v1.14.6+c07e432da Ready infra 2h v1.14.6+c07e432da Ready worker 2h v1.14.6+c07e432da Ready worker 2h v1.14.6+c07e432da Ready master 2h v1.14.6+c07e432da Ready master 2h v1.14.6+c07e432da Ready master 2h v1.14.6+c07e432da

Once our node is back, we would proceed with the next infra node.

We would eventually reconfigure our Ingress Controller deploying OpenShift Routers back to our infra nodes:

oc edit -n openshift-ingress-operator ingresscontroller default
      matchLabels: “”
  replicas: 3

We would then keep track of routers Pods as they’re being re-deployed:

oc get pods -n openshift-ingress -o wide
router-default-86cdb97784-4d72k 1/1 Running 0 14m
router-default-86cdb97784-8f5vm 1/1 Running 0 14m
router-default-86cdb97784-bvvdc 1/1 Running 0 105s

Ceph RBD Storage

Later on, we may want to configure OpenShift interfacing with an existing Ceph cluster, setting up persisting volumes.

While OpenShift 3 used to ship with rbd binaries in the api controller image, while allowing for their installation on OpenShift nodes, this is no longer the case with OpenShift 4. Instead, we would rely on CSI (Container Storage Interface), which is meant to be a more generic interface.

Then, we would need to deploy Ceph CSI interface to OpenShift,

git clone
oc new-project ceph-csi
for sa in rbd-csi-provisioner rbd-csi-nodeplugin; do
    oc create sa $sa
    oc adm policy add-scc-to-user hostaccess system:serviceaccount:ceph-csi:$sa
    oc adm policy add-scc-to-user privileged system:serviceaccount:ceph-csi:$sa
cat ceph-csi/deploy/rbd/kubernetes/v1.14+/csi-*yaml | sed ‘s|namespace: default|namespace: ceph-csi|g’ | oc apply -n ceph-csi -f-
cat <<EOF >config.json
    “clusterID”: “my-ceph-cluster-id”,
    “monitors: [ “”,”″″ ]
oc delete cm -n ceph-csi ceph-csi-config
oc create cm -n ceph-csi ceph-csi-config –from-file=config.json=./config.json
cat << EOF >secret.yaml
apiVersion: v1
kind: Secret
  name: ceph-rbd-secret
  userID: my-ceph-user-id
  userKey: my-user-key
oc apply -n default -f secret.yaml
cat << EOF >storageclass.yaml
kind: StorageClass
  name: ceph-storage
  clusterID: my-ceph-cluster-id
  pool: kube
  imageFeatures: layering csi-rbd-secret default csi-rbd-secret default xfs
reclaimPolicy: Delete
– discard
oc apply -f storageclass.yaml

At that stage, we would have deployed a DaemonSet of csi-rbdplugin Pods, tasked with attaching and detaching volumes during Pods scheduling and terminations, as well as a Deployment of csi-rbdplugin-provisioner Pods, creating and purging volumes out of Ceph, while managing OpenShift Persistent Volumes.

At that stage, we may create a first Persistent Volume and redeploy OpenShift integrated registry on top of it:

cat <<EOF >registry-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
  name: image-registry
  namespace: openshift-image-registry
  – ReadWriteOnce
      storage: 100Gi
oc apply -f registry-pvc.yaml
oc edit
      claim: image-registry-storage
oc get pods -n openshift-image-registry -w


First thing I would regret is the disappearance of rbd binaries from controllers images. As a result, the Ceph provisioner we used to configure with OpenShift3 no longer works. Apparently, CSI provisioners would be recommended instead, though that implementation is kind of slower, and involves quite a lot of Pods.

After deployment, roughly 12G RAM and 4 CPUs are allocated to cluster operators and OpenShift internals.

Another concern may be that all those operators are privileged actors in our cluster. While we usually had to compromise a node to attack a cluster, now we have a lot of operators that might be accessed through the API, arguably expanding OpenShift attack surface.

The dashboard shows a total CPU capacity of “100%”, which is quite useless.

OpenShift 4.2 is based on Kubernetes 1.14. Among nother novelties, as compared with OpenShift 3, we could mention Istio reaching GA or Tekton pipelines.

Docker Images Vulnerability Scan

While several solutions exist scanning Docker images, I’ve been looking for one that I could deploy and use on OpenShift, integrated into my existing CI chain.

The most obvious answer, working with opensource, would be OpenSCAP. Although I’m still largely working with Debian, while OpenSCAP would only check for CentOS databases.

Another popular contender on the market is Twistlock, but I’m not interested in solutions I can’t deploy myself without requesting for “a demo” or talking to people in general.

Eventually, I ended up deploying Clair, an open source product offered by CoreOS, providing with an API.
It queries popular vulnerabilities databases populating its own SQL database, and can then analyze Docker image layers posted to its API.

We could deploy Clair to OpenShift, alongside its Postgres database, using that Template.

The main issue I’ve had with Clair, so far, was that the client, clairctl, relies on Docker socket access, which is not something you would grant any deployment in OpenShift.
And since I wanted to scan my images as part of Jenkins pipelines, I would have my Jenkins master creating scan agents. Allowing Jenkins creating containers with host filesystem access is, in itself, a security issue, as any user that could create a Job scheduling agents with full access to my OpenShift nodes.

Introducing Klar. A project I found on GitHub, go-based, that can scan images against a Clair service, without any special privileges, besides pulling the Docker image out of your registry, and posting layers to Clair.

We would build a Jenkins agent re-using OpenShift base image, shipping with Klar.

Having build our Jenkins agent image, we can write another BuildConfig, defining a Parameterized Pipeline.

Jenkins CoreOS Clair Scan

Jenkins CoreOS Clair Scan