OpenShift & CephFS

If you’re not yet familiar with it, OpenShift is a container orchestration solution based on Kubernetes. Among others, it integrates with several storage providers such as Ceph.

Although GlusterFS is probably the best choice in terms of OpenShift integration, we could argue Ceph is a better pick overall. And while this post doesn’t aim at offering an exhaustive comparison between the two, we could mention GlusterFS split-brains requiring manual recoveries, poor block devices performances, poor performances dealing with lots (100s) of volumes, the lack of kernel-land client dealing with file volumes, …

The most common way to integrate Ceph with OpenShift is to register a StorageClass, as we could find in OpenShift documentations, managing Rados Block Devices.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: “true”
  name: ceph-storage
parameters:
adminId: kube
  adminSecretName: ceph-secret-kube
  adminSecretNamespace: default
  monitors: 10.42.253.110:6789,10.42.253.111:6789,10.42.253.112:6789
  pool: kube
  userId: kube
  userSecretName: ceph-secret-kube
  userSecretNamespace: default
provisioner: kubernetes.io/rbd
reclaimPolicy: Retain

We would also need to create a Secret, holding our Ceph client key. First, we would create our client, granting it with proper permissions:

$> ceph auth get-or-create client.kube mon ‘allow r’ osd ‘allow class-read object_prefix rbd_children, allow rwx pool=kube’ -o ceph.client.kube.keyring

Next, we would base64-encode our key:

$> awk ‘/^[ \t]*key/{print $3} ceph.client.kube.keyring | base64

And register our Secret, including our encoded secret:

cat <<EOF | oc apply -n default -f-

apiVersion: v1
data:
  key: <base64-encoded-string>
kind: Secret
metadata:
  name: ceph-secret-kube
type: kubernetes.io/rbd

EOF


The previous configurations would then allow us to dynamically provision block devices deploying new applications to OpenShift.

And while block devices is a nice thing to have, dealing with stateful workloads such as databases, up until now, GlusterFS main advantage over Ceph was its ability to provide with ReadWriteMany volumes – that can be mounted from several Pods at once, as opposed to ReadWriteOnce or ReadWriteOnly volumes, that may only be accessed by one deployment, unless mounted as without write capabilities.

On the other hand, in addition to Rados Block Devices, Ceph offers with an optional CephFS share, that is similar to NFS or GlusterFS, in that several clients can concurrently write the same folder. And while CephFS isn’t much mentioned reading through OpenShift documentations, Kubernetes officially supports it. Today, we would try and guess how to make that work with OpenShift.
CephFS is considered to be stable since Ceph 12 (Luminous), released a couple years ago. Since then, I’ve been working for a practical use case. Here it is.

We would mostly rely on the configurations offered in kubernetes-incubator external-storage’s GitHub repository.

First, let’s create a namespace hosting CephFS provisioner:

$> oc new-project cephfs

Then, in that namespace, we would register a Secret. Note that the CephFS provisioner offered by Kubernetes requires with near-admin privileges over your Ceph cluster. For each Persistent Volume registered through OpenShift API, the provisioner would create a dynamic user with limited privileges over the sub-directoriy hosting our data. Here, we would just pass it with our admin key:

apiVersion: v1
kind: Secret
data:
  key: <base64-encoded-admin-key>
metadata:
  name: ceph-secret-admin

Then, we would create a ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cephfs-provisioner
rules:
– apiGroups: [“”]
  resources: [“persistentvolumes”]
  verbs: [“get”, “list”, “watch”, “create”, “delete”]
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“persistentvolumeclaims”]
  verbs: [“get”, “list”, “watch”, “update”]
– apiGroups: [“storage.k8s.io”]
  resources: [“storageclasses”]
  verbs: [“get”, “list”, “watch”]
– apiGroups: [“”]
  resources: [“events”]
  verbs: [“create”, “update”, “patch”]
– apiGroups: [“”]
  resources: [“services”]
resourceNames: [“kube-dns”,”coredns”]
  verbs: [“list”, “get”]

A Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cephfs-provisioner
rules:
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“endpoints”]
  verbs: [“get”, “list”, “watch”, “create”, “update”, “patch”]

A ServiceAccount

$> oc create sa cephfs-provisioner

That we would associate with previously-defined ClusterRole and Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cephfs-provisioner
subjects:
– kind: ServiceAccount
  name: cephfs-provisioner
  namespace: cephfs
roleRef:
  kind: ClusterRole
  name: cephfs-provisioner
  apiGroup: rbac.authorization.k8s.io

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cephfs-provisioner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cephfs-provisioner
subjects:
– kind: ServiceAccount
  name: cephfs-provisioner

Next, we would allow our ServiceAccount using the anyuid SecurityContextConstraint:

$> oc adm policy add-scc-to-user anyuid -z cephfs-provisioner

Then, we would create an ImageStream:

$> oc create is cephfs-provisioner

A BuildConfig patching the cephfs-provisioner image, granting write privileges to owning group, such as OpenShift dynamic users may use our shares:

apiVersion: v1
kind: BuildConfig
metadata:
  name: cephfs-provisioner
spec:
  output:
    to:
      kind: ImageStreamTag
name: cephfs-provisioner:latest
  source:
    dockerfile: |
      FROM quay.io/external_storage/cephfs-provisioner:latest

      USER root

      RUN sed -i ‘s|0o755|0o775|g’ /usr/lib/python2.7/site-packages/ceph_volume_client.py
    type: Dockerfile
  strategy:
    type: Docker
  triggers:
  – type: ConfigChange

Next, we would create a StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs
provisioner: ceph.com/cephfs
parameters:
  adminId: admin
  adminSecretName: ceph-secret-admin
  adminSecretNamespace: cephfs
  claimRoot: /kube-volumes
  monitors: 10.42.253.110:6789,10.42.253.111:6789,10.42.253.112:6789

And a DeploymentConfig, deploying the CephFS provisioner:

apiVersion: v1
kind: DeploymentConfig
metadata:
  name: cephfs-provisioner
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: cephfs-provisioner
    spec:
      containers:
      – args: [ “-id=cephfs-provisioner-1” ]
        command: [ “/usr/local/bin/cephfs-provisioner” ]
        env:
        – name: PROVISIONER_NAME
          value: ceph.com/cephfs
        – name: PROVISIONER_SECRET_NAMESPACE
          value: cephfs
        image: ‘ ‘
        name: cephfs-provisioner
      serviceAccount: cephfs-provisioner
  triggers:
  – imageChangeParams:
      automatic: true
      containerNames: [ cephfs-provisioner ]
      from:
        kind: ImageStreamTag
        name: cephfs-provisioner:latest
    type: ImageChange
  – type: ConfigChange

And we should finally be able to create PersistentVolumeClaims, requesting CephFS-backed storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-cephfs
spec:
  accessModes: [ ReadWriteMany ]
  resources:
    requests:
      storage: 1Gi
  storageClassName: cephfs

Having registered the previous object, confirm our volume was properly provisioned:

$> oc get pvc
NAME STATUS VOLUME CAPA ACCESS MODES STORAGECLASS AGE
test-cephfs Bound pvc-xxx 1G RWX cephfs 5h

Then, we would create a Pod mounting that volume:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-test-cephfs
spec:
  containers:
  – image: docker.io/centos/mongodb-34-centos7:latest
    name: cephfs-rwx
    securityContext:
      capabilities:
        drop:
        – KILL
        – MKNOD
        – SETUID
        – SETGID
        privileged: false
    volumeMounts:
    – mountPath: /mnt/cephfs
      name: cephfs
  securityContext:
    seLinuxOptions:
      level: s0:c23,c2
  volumes:
  – name: cephfs
    persistentVolumeClaim:
      claimName: test-cephfs-claim

Once that Pod would have started, we should be able to enter and write our volume:

$ mount | grep cephfs
ceph-fuse on /mnt/cephfs type fuse.ceph-fuse (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
$ date >/mnt/cephfs/toto
$ cat /mnt/cephfs/toto
Wed May 15 19:06:20 UTC 2019

At that point, we should not a non-negligible drawback is the fact the CephFS kernel client doesn’t seem to allow reading from or writing to shares, from OpenShift Pods. Strangely enough, using a shell on the OpenShift node hosting that Pod, I can successfully write files and open them back. A few months ago, this was not the case: today, it would seem OpenShift is the main responsible, and next thing to fix.

Today, as a workaround, you would have to install ceph-fuse on all OpenShift nodes. At which point, any CephFS share would be mounted using ceph.fuse, instead of ceph kernel client.

Bearing in mind that CephFS main concurrent, GlusterFS, also uses a fuse-based client – while not providing with any kernel implementation – we can start infering Gluster is living its last days, as the most popular solution offering file-based storage in OpenShift.