Menu

Results for category "Ceph"

13 Articles

Ceph Crash Recovery

Today we will see the steps to replace a Ceph node, including the MON, MGR, MDS and OSD services.

While I would usually try to salvage the disks from a failing node, sometimes using dd conv=noerror,sync iflag=fullblock dumping a filesystem from a disk ignoring IO errors, others using a disk duplication devices: today I am unable to salvage my node filesystem, we would see the steps to re-deploy a new host from scratch.

 

Redeploy Hypervisor

Starting from new drives, I reinstalled my physical server, using a local PXE. Using puppet, the server is then configured as a KVM hypervisor, and automount gives me access to my KVM models. I would next deploy a CentOS 8 guest, re-using the FQDN and IP address from my previous Ceph node.

 

Ceph MON

I would start by installing EPEL and Ceph repositories – matching my Ceph release, Octopus. Update the system, as my CentOS KVM template dates back from 8.1. Reboot. Then, we would install Ceph packages:

dnf install -y ceph-osd python3-ceph-common ceph-mon ceph-mgr ceph-mds ceph-base python3-cephfs

For simplicity, we would disable SELinux and firewalling – matching my other nodes configuration, though this may not be recommended in general. Next, we would retrieve Ceph configuration out of a surviving MON node:

scp -rp mon1:/etc/ceph /etc/

Confirm that we can now query the cluster, and fetch the Ceph mon. keyring:

ceph osd tree
ceph auth get mon. -o /tmp/keyring
ceph mon getmap -o /tmp/monmap

Having those, you may now re-create your MON, re-using its previous ID (in my case, mon3)

mkdir /var/lib/ceph/mon/ceph-mon3
ceph-mon -i mon3 –mkfs –monmap /tmp/monmap –keyring /tmp/keyring
chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon3

Next, re-create the systemd unit starting Ceph MON daemon:

mkdir /etc/systemd/system/ceph-mon.target.wants/
cd /etc/systemd/system/ceph-mon.target.wants/
ln -sf /usr/lib/systemd/system/ceph-mon\@.service ceph-mon\@mon3.service
systemctl daemon-reload
systemctl start ceph-mon@mon3.service
journalctl -u ceph-mon@mon3.service
ceph -s

At that stage, we should be able to see the third monitor re-joined our cluster.

 

Ceph OSDs

The next step would be to re-deploy our OSDs. Note that, at that stage, we should be able to re-import our existing OSD devices, assuming they were not affected by the outage. Here, I would also proceed with fresh drives (logical volumes on my hypervisor).

Let’s fetch the osd bootstrap key (could be done with ceph auth get, as for the mon. key above) and prepare the directories for my OSDs (IDs 1 and 7)

scp -p mon1:/var/lib/ceph/bootstrap-osd/ceph.keyring /var/lib/ceph/bootstrap-osd/
mkdir -p /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-7
chown -R ceph:ceph /var/lib/ceph/bootstrap-osd /var/lib/ceph/osd

Next, I would re-create my first OSD:

ceph osd out osd.7
ceph osd destroy 7
ceph osd destroy 7 –yes-i-really-mean-it
ceph-volume lvm zap /dev/vdb
ceph-volume lvm prepare –osd-id 7 –data /dev/vdb
ceph-volume lvm activate –all
ceph osd tree
ceph -s

We should be able to confirm the OSD 7 is back up, and that Ceph is now re-balancing data. Let’s finish with the second OSD I need to recover, following the same process:

ceph osd out osd.1
ceph osd destroy 1
ceph osd destroy 1 –yes-i-really-mean-it
ceph-volume lvm zap /dev/vdc
ceph-volume lvm prepare –osd-id 1 –data /dev/vdc
ceph-volume lvm activate –all
ceph osd tree
ceph -s

At that stage, the most critical part is done. Considering I am running on commodity hardware, VMs on old HP Micro Servers with barely 6G RAM and a couple CPUs, I’ve left Ceph alone for something like 12 hours, to avoid wasting resources with MDS and MGR daemons that are not strictly mandatory.

 

Ceph MGR

My data being pretty much all re-balanced, no more degraded or undersized PGs, I eventually redeployed the MGR service:

mkdir /var/lib/ceph/mgr/ceph-mon3
ceph auth get mgr.mon3 -o /var/lib/ceph/mgr/ceph-mon3/keyring
chown -R ceph:ceph /var/lib/ceph/mgr
mkdir ceph-mgr.target.wants
cd ceph-mgr.target.wants
ln -sf /usr/lib/systemd/system/ceph-mgr\@.service ceph-mgr\@mon3.service
systemctl daemon-reload
systemctl start ceph-mgr@mon3
ceph -s

 

Ceph MDS

Next, as I’m using CephFS with my Kubernetes clusters, I would need to recreate my third MDS. The process is pretty similar to that of deploying a MGR:

mkdir /var/lib/ceph/mds/ceph-mon3
ceph auth get mds.mon3 -o /var/lib/ceph/mds/ceph-mon3/keyring
chown -R ceph:ceph /var/lib/ceph/mds
mkdir ceph-mds.target.wants
cd ceph-mds.target.wants
ln -sf /usr/lib/systemd/system/ceph-mds\@.service ceph-mds\@mon3.service
systemctl daemon-reload
systemctl start ceph-mds@mon3
ceph -s

 

Finishing up

In my case, I’ve also installed and configured NRPE probes.
Note that I did not cover the topic of re-deploying RadosGW: in my case, with limited resources on my Ceph nodes, my RGW daemons are deployed as containers, in Kubernetes. My workloads that require s3 storage may then use Kubernetes local Services and Secrets with pre-provisioned radosgw user credentials.

Obviously I could have used ceph-ansible redeploying my node. Though there’s more fun in knowing how it works behind the curtain. I would mostly use ceph-ansible deploying clusters, with customers — mine was last deployed a couple years ago.

The time going from an installed CentOS to a fully recovered Ceph node is pretty amazing. No more than one hour re-creating Ceph daemons, a little over 12h watching objects being recovered and re-balanced — for over 2Tb of data / 5T raw, and again: old and low quality hardware, when compared with what we would deal with IRL.

From its first stable release, in early 2010s, to the product we have today, Ceph is a undeniably a success story in the Open Sources ecosytem. Haven’t seen a corruption I couldn’t easily fix in years, low maintenance cost, still runs fairly well on commodity hardware. And all that despite of RedHat usual fuckery.

Kubernetes & Ceph on Raspberry Pi

Having recently deployed yet-another Kubernetes cluster using Kubespray, experimenting with a Raspberry Pi lab, I’ve been looking with some issue I was last having, with Raspbian.

The issue being that Raspbian does not ship with the rbd kernel module, which is usually necessary attaching rbd devices out of a Ceph cluster.
One way to get around this would obviously be to rebuild the kernel, though I’m usually reluctant to do so.

Digging further, it appears that as an alternative to the rbd kernel module, we may use the nbd one, which does ship with Raspbian.

 

Here is how we may proceed:

 

# apt-get install ceph-common rbd-nbd
# scp -p root@mon1:/etc/ceph/ceph.conf /etc/ceph
# scp -p root@mon1:/etc/ceph/ceph.client.admin.keyring /etc/ceph
# rbd -p kube ls
# rbd-nbd map kube/kubernetes-dynamic-pvc-d213345c-c5e8-11ea-ab48-ae4bf5a40627
# mount /dev/nbd0 /mnt
[...]
# mount /mnt
# rbd-nbd unmap /dev/nbd0

 

Now, that’s a first step. The next one would to to get this working with Kubernetes.
Lately, the CSI (Container Storage Interface) is being promoted up to a point that the Kubernetes scheduler image no longer ships with Ceph binaries: a controller will be dealing devices provisioning, while another one would be attaching and releasing Ceph block devices on behalf of our nodes.

Fortunately, while it is not official yet, all of the container images required can be found, searching on GitHub and DockerHub.
I did publish a copy of the configuration files required setting up Ceph rbd provisioning and devices mapping on Kubernetes. Although it is mandatory for your nodes to run some arm64 versions of the Raspbian image, which is currently in beta, though has been working pretty well as far as I could see.

 

This is yet another victory for Kubernetes, against products such as OpenShift: Lightweight, modular, portable, easy to deploy, … Efficient.

OpenShift & CephFS

If you’re not yet familiar with it, OpenShift is a container orchestration solution based on Kubernetes. Among others, it integrates with several storage providers such as Ceph.

Although GlusterFS is probably the best choice in terms of OpenShift integration, we could argue Ceph is a better pick overall. And while this post doesn’t aim at offering an exhaustive comparison between the two, we could mention GlusterFS split-brains requiring manual recoveries, poor block devices performances, poor performances dealing with lots (100s) of volumes, the lack of kernel-land client dealing with file volumes, …

The most common way to integrate Ceph with OpenShift is to register a StorageClass, as we could find in OpenShift documentations, managing Rados Block Devices.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: “true”
  name: ceph-storage
parameters:
adminId: kube
  adminSecretName: ceph-secret-kube
  adminSecretNamespace: default
  monitors: 10.42.253.110:6789,10.42.253.111:6789,10.42.253.112:6789
  pool: kube
  userId: kube
  userSecretName: ceph-secret-kube
  userSecretNamespace: default
provisioner: kubernetes.io/rbd
reclaimPolicy: Retain

We would also need to create a Secret, holding our Ceph client key. First, we would create our client, granting it with proper permissions:

$> ceph auth get-or-create client.kube mon ‘allow r’ osd ‘allow class-read object_prefix rbd_children, allow rwx pool=kube’ -o ceph.client.kube.keyring

Next, we would base64-encode our key:

$> awk ‘/^[ \t]*key/{print $3} ceph.client.kube.keyring | base64

And register our Secret, including our encoded secret:

cat <<EOF | oc apply -n default -f-

apiVersion: v1
data:
  key: <base64-encoded-string>
kind: Secret
metadata:
  name: ceph-secret-kube
type: kubernetes.io/rbd

EOF


The previous configurations would then allow us to dynamically provision block devices deploying new applications to OpenShift.

And while block devices is a nice thing to have, dealing with stateful workloads such as databases, up until now, GlusterFS main advantage over Ceph was its ability to provide with ReadWriteMany volumes – that can be mounted from several Pods at once, as opposed to ReadWriteOnce or ReadWriteOnly volumes, that may only be accessed by one deployment, unless mounted as without write capabilities.

On the other hand, in addition to Rados Block Devices, Ceph offers with an optional CephFS share, that is similar to NFS or GlusterFS, in that several clients can concurrently write the same folder. And while CephFS isn’t much mentioned reading through OpenShift documentations, Kubernetes officially supports it. Today, we would try and guess how to make that work with OpenShift.
CephFS is considered to be stable since Ceph 12 (Luminous), released a couple years ago. Since then, I’ve been working for a practical use case. Here it is.

We would mostly rely on the configurations offered in kubernetes-incubator external-storage’s GitHub repository.

First, let’s create a namespace hosting CephFS provisioner:

$> oc new-project cephfs

Then, in that namespace, we would register a Secret. Note that the CephFS provisioner offered by Kubernetes requires with near-admin privileges over your Ceph cluster. For each Persistent Volume registered through OpenShift API, the provisioner would create a dynamic user with limited privileges over the sub-directoriy hosting our data. Here, we would just pass it with our admin key:

apiVersion: v1
kind: Secret
data:
  key: <base64-encoded-admin-key>
metadata:
  name: ceph-secret-admin

Then, we would create a ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cephfs-provisioner
rules:
– apiGroups: [“”]
  resources: [“persistentvolumes”]
  verbs: [“get”, “list”, “watch”, “create”, “delete”]
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“persistentvolumeclaims”]
  verbs: [“get”, “list”, “watch”, “update”]
– apiGroups: [“storage.k8s.io”]
  resources: [“storageclasses”]
  verbs: [“get”, “list”, “watch”]
– apiGroups: [“”]
  resources: [“events”]
  verbs: [“create”, “update”, “patch”]
– apiGroups: [“”]
  resources: [“services”]
resourceNames: [“kube-dns”,”coredns”]
  verbs: [“list”, “get”]

A Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cephfs-provisioner
rules:
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“endpoints”]
  verbs: [“get”, “list”, “watch”, “create”, “update”, “patch”]

A ServiceAccount

$> oc create sa cephfs-provisioner

That we would associate with previously-defined ClusterRole and Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cephfs-provisioner
subjects:
– kind: ServiceAccount
  name: cephfs-provisioner
  namespace: cephfs
roleRef:
  kind: ClusterRole
  name: cephfs-provisioner
  apiGroup: rbac.authorization.k8s.io

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cephfs-provisioner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cephfs-provisioner
subjects:
– kind: ServiceAccount
  name: cephfs-provisioner

Next, we would allow our ServiceAccount using the anyuid SecurityContextConstraint:

$> oc adm policy add-scc-to-user anyuid -z cephfs-provisioner

Then, we would create an ImageStream:

$> oc create is cephfs-provisioner

A BuildConfig patching the cephfs-provisioner image, granting write privileges to owning group, such as OpenShift dynamic users may use our shares:

apiVersion: v1
kind: BuildConfig
metadata:
  name: cephfs-provisioner
spec:
  output:
    to:
      kind: ImageStreamTag
name: cephfs-provisioner:latest
  source:
    dockerfile: |
      FROM quay.io/external_storage/cephfs-provisioner:latest

      USER root

      RUN sed -i ‘s|0o755|0o775|g’ /usr/lib/python2.7/site-packages/ceph_volume_client.py
    type: Dockerfile
  strategy:
    type: Docker
  triggers:
  – type: ConfigChange

Next, we would create a StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs
provisioner: ceph.com/cephfs
parameters:
  adminId: admin
  adminSecretName: ceph-secret-admin
  adminSecretNamespace: cephfs
  claimRoot: /kube-volumes
  monitors: 10.42.253.110:6789,10.42.253.111:6789,10.42.253.112:6789

And a DeploymentConfig, deploying the CephFS provisioner:

apiVersion: v1
kind: DeploymentConfig
metadata:
  name: cephfs-provisioner
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: cephfs-provisioner
    spec:
      containers:
      – args: [ “-id=cephfs-provisioner-1” ]
        command: [ “/usr/local/bin/cephfs-provisioner” ]
        env:
        – name: PROVISIONER_NAME
          value: ceph.com/cephfs
        – name: PROVISIONER_SECRET_NAMESPACE
          value: cephfs
        image: ‘ ‘
        name: cephfs-provisioner
      serviceAccount: cephfs-provisioner
  triggers:
  – imageChangeParams:
      automatic: true
      containerNames: [ cephfs-provisioner ]
      from:
        kind: ImageStreamTag
        name: cephfs-provisioner:latest
    type: ImageChange
  – type: ConfigChange

And we should finally be able to create PersistentVolumeClaims, requesting CephFS-backed storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-cephfs
spec:
  accessModes: [ ReadWriteMany ]
  resources:
    requests:
      storage: 1Gi
  storageClassName: cephfs

Having registered the previous object, confirm our volume was properly provisioned:

$> oc get pvc
NAME STATUS VOLUME CAPA ACCESS MODES STORAGECLASS AGE
test-cephfs Bound pvc-xxx 1G RWX cephfs 5h

Then, we would create a Pod mounting that volume:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-test-cephfs
spec:
  containers:
  – image: docker.io/centos/mongodb-34-centos7:latest
    name: cephfs-rwx
    securityContext:
      capabilities:
        drop:
        – KILL
        – MKNOD
        – SETUID
        – SETGID
        privileged: false
    volumeMounts:
    – mountPath: /mnt/cephfs
      name: cephfs
  securityContext:
    seLinuxOptions:
      level: s0:c23,c2
  volumes:
  – name: cephfs
    persistentVolumeClaim:
      claimName: test-cephfs-claim

Once that Pod would have started, we should be able to enter and write our volume:

$ mount | grep cephfs
ceph-fuse on /mnt/cephfs type fuse.ceph-fuse (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
$ date >/mnt/cephfs/toto
$ cat /mnt/cephfs/toto
Wed May 15 19:06:20 UTC 2019

At that point, we should not a non-negligible drawback is the fact the CephFS kernel client doesn’t seem to allow reading from or writing to shares, from OpenShift Pods. Strangely enough, using a shell on the OpenShift node hosting that Pod, I can successfully write files and open them back. A few months ago, this was not the case: today, it would seem OpenShift is the main responsible, and next thing to fix.

Today, as a workaround, you would have to install ceph-fuse on all OpenShift nodes. At which point, any CephFS share would be mounted using ceph.fuse, instead of ceph kernel client.

Bearing in mind that CephFS main concurrent, GlusterFS, also uses a fuse-based client – while not providing with any kernel implementation – we can start infering Gluster is living its last days, as the most popular solution offering file-based storage in OpenShift.

Ceph Luminous – 12

In the last few days, Ceph published Luminous 12.1.1 packages to their repositories, release candidate of their future LTS. Having had bad experiences with their previous RC, I gave it a fresh look, dropping ceph-deploy and writing my own Ansible roles instead.

typo listing RGW process

Small typo displaying Ceph Health, if you can notice it

Noticeable changes since Luminous include CephFS being -allegedly- stable. I didn’t test it myself yet, although I’ve been hearing about that feature being unstable since my first days testing Ceph, years ago.

RadosGateway multisite sync status

RadosGateway multisite sync status

Another considerable change that showed up and is now considered stable, is a replacement implementation of Ceph FileStore (relying on ext4, xfs or btrfs partitions), called BlueStore. The main change being that Ceph Object Storage processes would no longer mount a large filesystem storing their data. Beware that recovery scripts reconstructing block devices scanning for PG content in OSD filesystems would no longer work – it is yet unclear how a disaster recovery would work, recovering data from an offline cluster. No surprises otherwise, so far so good.

Also advertised: the RBD-mirror daemon (introduced lately) is now considered stable running in HA. From what I’ve seen, it is yet unclear how to configure HA starting several mirrors in a single cluster – I very much doubt this would work out of the box. We’ll probably have to wait a little longer, for Ceph documentation to reflect the last changes introduced on that matter.

As I had 2 Ceph clusters, I could confirm RadosGW Multisite configuration works perfectly. Now that’s not a new feature, still it’s the first time I actually set this up. Buckets are eventually replicated to remote cluster. Documentation is way more exhaustive, works as advertised: I’ll stick to this, until we learn more about RBD mirroring.

Querying for the MON commands

Ceph RestAPI Gateway

Freshly introduced: the Ceph RestAPI Gateway. Again, we’re missing some docs yet. On paper, this service should allow you to query your cluster as you would have with Ceph CLI tools, via a Restful API. Having set one up, it isn’t much complicated – I would recommend not to use their built-in webserver, and instead use nginx and uwsgi. The basics on that matter could be found on GitHub.

Ceph health turns to warning, watch for un-scrubbed PGs

Ceph health turns to warning, watch for un-scrubbed PGs

Even though Ceph Luminous shouldn’t reach LTS before their 12.2.0 release, as of today, I can confirm Debian Stretch packages are working relatively well on a 3-MON 3-MGR 3-OSD 2-RGW with some haproxy balancer setup, serving with s3-like buckets. Although note there is some weirdness regarding PG scrubbing, you may need to add a cron job … And if you consider running Ceph on commodity hardware, consider that their last releases may be broken.

ceph dashboard

Ceph Dashboard

edit: LTS released as of late August: SSE4.2 support still mandatory deploying your Luminous cluster, although a fix recently reached their master branch, ….

As of late September, Ceph 12.2.1 release can actually be installed on older, commodity hardware.
Meanwhile, a few screenshots of Ceph Dashboard were posted to ceph website, advertising on that new feature.

Reweighting Ceph

For those familiar with the earlier versions of Ceph, you may be familiar with that process, as objects were not necessarily evenly distributed across the storage nodes of your cluster. Nowadays, and since somewhere around Firefly and Hammer, the default placement algorithm is way more effective on that matter.

Still, after a year running what started as a five hosts/three MONs/18 OSDs cluster, and grew up to eight hosts and 29 OSDs, two of them out – pending replacement – it finally happened, one of my disks usage went up to over 90%:

Ceph disks usage, before reweight

Ceph disks usage, before reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.7T 982G 74% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.7T 174G 91% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 1002G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 67M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 73M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 718G 62% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 778G 73% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.0T 749G 74% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 493G 434G 54% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 953G 75% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.2T 1.5T 61% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.1T 1.6T 57% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 61% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.5T 1.3T 55% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

It was time to do something:

ceph:~# ceph osd reweight-by-utilization
SUCCESSFUL reweight-by-utilization: average 0.653007, overload 0.783608. reweighted: osd.23 [1.000000 -> 0.720123]

about two hours later, all fixed:

Ceph disks usage, after reweight

Ceph disks usage, after reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.8T 904G 76% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.2T 638G 66% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 976G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 69M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 75M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 666G 65% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 830G 71% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.1T 696G 76% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 518G 409G 56% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 928G 76% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.3T 1.4T 62% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.6T 1.3T 56% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

one year of ceph

one year of ceph

I can’t speak for IO-intensive cases, although as far as I’ve seen the process of reweighting an OSD or repairing damaged placement groups blends pretty well with my usual workload.
Then again, Ceph provides with ways to priorize operations (such as backfill or recovery), allowing you to fine tune your cluster, using commands such as:

# ceph tell osd.* injectargs ‘–osd-max-backfills 1’
# ceph tell osd.* injectargs ‘–osd-max-recovery-threads 1’
# ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’
# ceph tell osd.* injectargs ‘–osd-client-op-priority 63’
# ceph tell osd.* injectargs ‘–osd-recovery-max-active 1’

While on the subject, last screenshot to celebrate one year running Ceph and OpenNebula, illustrating how much crap I can hoard.

2016 upgrade

Quick post sharing a few pictures I took this February, as I finally replaced my plastic shelf by some steel rack.

I took that opportunity to add an UPS, a third PDU, my 7th & 8th Ceph hosts.
Thanks to Ceph & OpenNebula, I haven’t had to shut down any of my services.

Bonjour Serveurs (suite)

Scaling out with Ceph

A few months ago, I installed a Ceph cluster hosting disk images, for my OpenNebula cloud.
This cluster is based on 5 ProLian N54L, each with a 60G SSD for the main filesystems, some with 1 512G SSD OSD, all with 3 disk drives from 1 to 4T. SSD are grouped in a pool, HDD in an other.

OpenNebula Datastores View - before

OpenNebula Datastores View, having 5 Ceph OSD hosts

Now that most my services are in this cluster, I’m left with very few free space.
The good news is there is no significant impact on performances, as I was experiencing with ZFS.
The bad news, is that I urgently need to add some storage space.

Last Sunday, I ordered my sixth N54L on eBay (from my “official” refurbish-er, BargainHardware) and a few disks.
After receiving everything, I installed the latest Ubuntu LTS (Trusty) from my PXE, installed puppet, prepared everything, … In about an hour, I was ready to add my disks.

I use a custom crush map, and the osd “crush update on start” set to false, in my ceph.conf.
This was the first time I tested this, and I was pleased to see I can run ceph-deploy to prepare my OSD, without automatically adding it to the default CRUSH root – especially having two pools.
From my ceph-deploy host (some Xen PV I use hosting ceph-dash, munin and nagios probes related to ceph, but with no OSD nor MON actually running), I ran the following:

# ceph-deploy install erebe
# ceph-deploy disk list erebe
# ceph-deploy disk zap erebe:sda
# ceph-deploy disk zap erebe:sdb
# ceph-deploy disk zap erebe:sdc
# ceph-deploy disk zap erebe:sdd
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sdb
# ceph-deploy osd prepare erebe:sdc
# ceph-deploy osd prepare erebe:sdd

At that point, the 4 new OSD were up and running according to ceph status, though no data was assigned to them.
Next step was to update my crushmap, including these new OSDs in the proper root.

# ceph osd getcrushmap -o compiled_crush
# crushtool -d compiled_crush -o plain_crush
# vi plain_crush
# crushtool -c plain_crush -o new-crush
# ceph osd setcrushmap -i new-crush

For the record, the content of my current crush map is the following:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host nyx-hdd {
id -2 # do not change unnecessarily
# weight 10.890
alg straw
hash 0 # rjenkins1
item osd.1 weight 3.630
item osd.2 weight 3.630
item osd.3 weight 3.630
}
host eos-hdd {
id -3 # do not change unnecessarily
# weight 11.430
alg straw
hash 0 # rjenkins1
item osd.5 weight 3.630
item osd.6 weight 3.900
item osd.7 weight 3.900
}
host hemara-hdd {
id -4 # do not change unnecessarily
# weight 9.980
alg straw
hash 0 # rjenkins1
item osd.9 weight 3.630
item osd.10 weight 3.630
item osd.11 weight 2.720
}
host selene-hdd {
id -5 # do not change unnecessarily
# weight 5.430
alg straw
hash 0 # rjenkins1
item osd.12 weight 1.810
item osd.13 weight 2.720
item osd.14 weight 0.900
}
host helios-hdd {
id -6 # do not change unnecessarily
# weight 3.050
alg straw
hash 0 # rjenkins1
item osd.15 weight 1.600
item osd.16 weight 0.700
item osd.17 weight 0.750
}
host erebe-hdd {
id -7 # do not change unnecessarily
# weight 7.250
alg straw
hash 0 # rjenkins1
item osd.19 weight 2.720
item osd.20 weight 1.810
item osd.21 weight 2.720
}
root hdd {
id -1 # do not change unnecessarily
# weight 40.780
alg straw
hash 0 # rjenkins1
item nyx-hdd weight 10.890
item eos-hdd weight 11.430
item hemara-hdd weight 9.980
item selene-hdd weight 5.430
item helios-hdd weight 3.050
item erebe-hdd weight 7.250
}
host nyx-ssd {
id -42 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.460
}
host eos-ssd {
id -43 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.460
}
host hemara-ssd {
id -44 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.450
}
host erebe-ssd {
id -45 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.18 weight 0.450
}
root ssd {
id -41 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item nyx-ssd weight 1.000
item eos-ssd weight 1.000
item hemara-ssd weight 1.000
item erebe-ssd weight 1.000
}
# rules
rule hdd {
ruleset 0
type replicated
min_size 1
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}
# end crush map

Applying the new crush map, a 20 hours process started, moving placement groups.

OpenNebula Datastores View - after

OpenNebula Datastores View, having 6 Ceph OSD hosts

# ceph-diskspace
/dev/sdc1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-6
/dev/sda1 472G 330G 143G 70% /var/lib/ceph/osd/ceph-4
/dev/sdb1 3.7T 2.4T 1.4T 64% /var/lib/ceph/osd/ceph-5
/dev/sdd1 3.7T 2.4T 1.4T 65% /var/lib/ceph/osd/ceph-7
/dev/sda1 442G 329G 114G 75% /var/lib/ceph/osd/ceph-18
/dev/sdb1 2.8T 2.1T 668G 77% /var/lib/ceph/osd/ceph-19
/dev/sdc1 1.9T 1.3T 593G 69% /var/lib/ceph/osd/ceph-20
/dev/sdd1 2.8T 2.0T 808G 72% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 562G 365G 61% /var/lib/ceph/osd/ceph-17
/dev/sdb1 927G 564G 363G 61% /var/lib/ceph/osd/ceph-16
/dev/sda1 1.9T 1.2T 630G 67% /var/lib/ceph/osd/ceph-15
/dev/sdb1 3.7T 2.8T 935G 75% /var/lib/ceph/osd/ceph-9
/dev/sdd1 2.8T 1.4T 1.4T 50% /var/lib/ceph/osd/ceph-11
/dev/sda1 461G 274G 187G 60% /var/lib/ceph/osd/ceph-8
/dev/sdc1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-10
/dev/sdc1 3.7T 1.9T 1.8T 52% /var/lib/ceph/osd/ceph-1
/dev/sde1 3.7T 2.0T 1.7T 54% /var/lib/ceph/osd/ceph-3
/dev/sdd1 3.7T 2.3T 1.5T 62% /var/lib/ceph/osd/ceph-2
/dev/sdb1 472G 308G 165G 66% /var/lib/ceph/osd/ceph-0
/dev/sdb1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-12
/dev/sdd1 927G 580G 347G 63% /var/lib/ceph/osd/ceph-14
/dev/sdc1 2.8T 2.0T 813G 71% /var/lib/ceph/osd/ceph-13

I’m really satisfied by the way ceph is constantly improving their product.
Having discussed with several interviewers in the last few weeks, I’m still having to explain why ceph rbd is not to be confused with cephfs, and if the latter may not be production ready, rados storage is just the thing you could be looking for distributing your storage.

Don’t trust the Tahr

Beware that since latest Ubuntu kernel upgrades (14.04.02), you may lose network rebooting your servers!

I’ve had the problem four days ago, rebooting one of my OpenNebula hosts. Still unreachable after 5 minutes, I logged in physically, to see all my “p1pX” and “p4pX” interfaces had disappeared.
Checking udev rules, there is now a file fixing interfaces mapping. On a server I have not rebooted yet, this file doesn’t exist.

The story could have ended here. But with Ubuntu, updates is a daily struggle: today, one of my ceph OSD (hosting 4 disks) spontaneously stopped working.
Meaning: the host was still there, I was able to open a shell using SSH. Checking processes, all ceph osd deamon were stopped. Starting them showed no error, while processes were still absent. Checking dmesg, I had several lines of SSL-related segfaults.
As expected, rebooting fixed everything, from ceph, to my network interfaces names.
It’s in these days I most enjoy freelancing: I can address my system and network outages in time, way before it’s too late.

While I was starting to accept Ubuntu as safe enough to run production services, renaming interfaces on a production system is unacceptable. I’m curious to know how Canonical dealt with that providing BootStack and OpenStack-based services.

Note there is still a way to prevent your interfaces from being renamed:

# ln -s /dev/null /etc/udev/rules.d/75-persistent-net-generator.rules

OwnCloud & Pydio

You may have heard of OwnCloud at least, if you’re not using one already. it with more than a couple users.

Thanks to a very fine web frontend, and several standalone clients allowing to to use your shares as a network file system, OwnCloud is user friendly, and could be trusted hosting hundreds of accounts, if not thousands.
The solution was installed in Smile, by Thibaut (59pilgrim). I didn’t mind that much, back then, I was using Pydio, and pretty satisfied already. We had around 700 users, not all being active, yet I could see the whole thing was pretty reliable.

Pydio is a good candidate to compare with OwnCloud. Both offer pretty much the same services. OwnCloud has lots of apps to do everything, Pydio has plugins. Both are PHP-based opensource projects, with fairly active communities.
Small advantage to OwnCloud though, with his native S3 connector. And arguably, a better linux client and web experience.

Recently, disappointed by Pydio – something about having \n in file names, preventing files from being uploaded to my Pydio server – I gave a shot to OwnCloud.
I haven’t lost hope in Pydio yet, but OwnCloud is definitely easier to deal with: I could even recommend it to a novice Linux enthusiast.

Ceph Disk Failure

Last week, one of my five Ceph hosts was unreachable.
Investigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing the connection).

After rebooting the system and looking at logs, I could see a lot of I/O errors. I left the console logged in to root, waiting for the next occurrence.
Having no spare 60GB SSD, I ordered one.

Two days later, the same problem occurred. From the console, I was unable to run anything (mostly segfaults and ENOENT).
Again, I was able to reboot. This time, I dropped a couple LVMs, unmounted the swap partition, and resized my VG to make sure I had a fair amount of unallocated space on my faulty disk.

The problem persisted, while average uptime was significantly getting lower.
I progressively disabled local OSSEC daemon, puppet, a few crontabs, collectd, munin, … only keeping ceph, nagios and ssh running. The problem kept happening, every 12 to 48 hours.

This morning, the server wasn’t even able to boot.
Checking the BIOS, my root SSD wasn’t detected.
Attaching it to some USB dock, I had to wait a couples minutes before the disk was actually detected by my laptop (Ubuntu 14.04.02), and my desktop (Debian 7.8).
I caught a break when receiving my new disk at 11 AM.
Running dd from the faulty disk to the new one took around 50 minutes (20MB/s, I can’t believe it!).
Syncing (1x512G SSD, 2*4T & 1*3T HDD) after 8 hours of downtime, took around half an hour. Knowing I run a fairly busy mail server, some nntp index, …), this is a new tangible improvement brought by Hammer, over Firefly.

I’m now preparing to send the faulty disk to my re-seller, for replacement. At least, I would have one handy, for the next failure.

Morality: cheap is unreasonable. Better be lucky.