Results for category "Ceph"

11 Articles

OpenShift & CephFS

If you’re not yet familiar with it, OpenShift is a container orchestration solution based on Kubernetes. Among others, it integrates with several storage providers such as Ceph.

Although GlusterFS is probably the best choice in terms of OpenShift integration, we could argue Ceph is a better pick overall. And while this post doesn’t aim at offering an exhaustive comparison between the two, we could mention GlusterFS split-brains requiring manual recoveries, poor block devices performances, poor performances dealing with lots (100s) of volumes, the lack of kernel-land client dealing with file volumes, …

The most common way to integrate Ceph with OpenShift is to register a StorageClass, as we could find in OpenShift documentations, managing Rados Block Devices.

kind: StorageClass
  annotations: “true”
  name: ceph-storage
adminId: kube
  adminSecretName: ceph-secret-kube
  adminSecretNamespace: default
  pool: kube
  userId: kube
  userSecretName: ceph-secret-kube
  userSecretNamespace: default
reclaimPolicy: Retain

We would also need to create a Secret, holding our Ceph client key. First, we would create our client, granting it with proper permissions:

$> ceph auth get-or-create client.kube mon ‘allow r’ osd ‘allow class-read object_prefix rbd_children, allow rwx pool=kube’ -o ceph.client.kube.keyring

Next, we would base64-encode our key:

$> awk ‘/^[ \t]*key/{print $3} ceph.client.kube.keyring | base64

And register our Secret, including our encoded secret:

cat <<EOF | oc apply -n default -f-

apiVersion: v1
  key: <base64-encoded-string>
kind: Secret
  name: ceph-secret-kube


The previous configurations would then allow us to dynamically provision block devices deploying new applications to OpenShift.

And while block devices is a nice thing to have, dealing with stateful workloads such as databases, up until now, GlusterFS main advantage over Ceph was its ability to provide with ReadWriteMany volumes – that can be mounted from several Pods at once, as opposed to ReadWriteOnce or ReadWriteOnly volumes, that may only be accessed by one deployment, unless mounted as without write capabilities.

On the other hand, in addition to Rados Block Devices, Ceph offers with an optional CephFS share, that is similar to NFS or GlusterFS, in that several clients can concurrently write the same folder. And while CephFS isn’t much mentioned reading through OpenShift documentations, Kubernetes officially supports it. Today, we would try and guess how to make that work with OpenShift.
CephFS is considered to be stable since Ceph 12 (Luminous), released a couple years ago. Since then, I’ve been working for a practical use case. Here it is.

We would mostly rely on the configurations offered in kubernetes-incubator external-storage’s GitHub repository.

First, let’s create a namespace hosting CephFS provisioner:

$> oc new-project cephfs

Then, in that namespace, we would register a Secret. Note that the CephFS provisioner offered by Kubernetes requires with near-admin privileges over your Ceph cluster. For each Persistent Volume registered through OpenShift API, the provisioner would create a dynamic user with limited privileges over the sub-directoriy hosting our data. Here, we would just pass it with our admin key:

apiVersion: v1
kind: Secret
  key: <base64-encoded-admin-key>
  name: ceph-secret-admin

Then, we would create a ClusterRole

kind: ClusterRole
  name: cephfs-provisioner
– apiGroups: [“”]
  resources: [“persistentvolumes”]
  verbs: [“get”, “list”, “watch”, “create”, “delete”]
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“persistentvolumeclaims”]
  verbs: [“get”, “list”, “watch”, “update”]
– apiGroups: [“”]
  resources: [“storageclasses”]
  verbs: [“get”, “list”, “watch”]
– apiGroups: [“”]
  resources: [“events”]
  verbs: [“create”, “update”, “patch”]
– apiGroups: [“”]
  resources: [“services”]
resourceNames: [“kube-dns”,”coredns”]
  verbs: [“list”, “get”]

A Role

kind: Role
  name: cephfs-provisioner
– apiGroups: [“”]
  resources: [“secrets”]
  verbs: [“create”, “get”, “delete”]
– apiGroups: [“”]
  resources: [“endpoints”]
  verbs: [“get”, “list”, “watch”, “create”, “update”, “patch”]

A ServiceAccount

$> oc create sa cephfs-provisioner

That we would associate with previously-defined ClusterRole and Role:

kind: ClusterRoleBinding
  name: cephfs-provisioner
– kind: ServiceAccount
  name: cephfs-provisioner
  namespace: cephfs
  kind: ClusterRole
  name: cephfs-provisioner

kind: RoleBinding
  name: cephfs-provisioner
  kind: Role
  name: cephfs-provisioner
– kind: ServiceAccount
  name: cephfs-provisioner

Next, we would allow our ServiceAccount using the anyuid SecurityContextConstraint:

$> oc adm policy add-scc-to-user anyuid -z cephfs-provisioner

Then, we would create an ImageStream:

$> oc create is cephfs-provisioner

A BuildConfig patching the cephfs-provisioner image, granting write privileges to owning group, such as OpenShift dynamic users may use our shares:

apiVersion: v1
kind: BuildConfig
  name: cephfs-provisioner
      kind: ImageStreamTag
name: cephfs-provisioner:latest
    dockerfile: |

      USER root

      RUN sed -i ‘s|0o755|0o775|g’ /usr/lib/python2.7/site-packages/
    type: Dockerfile
    type: Docker
  – type: ConfigChange

Next, we would create a StorageClass:

kind: StorageClass
  name: cephfs
  adminId: admin
  adminSecretName: ceph-secret-admin
  adminSecretNamespace: cephfs
  claimRoot: /kube-volumes

And a DeploymentConfig, deploying the CephFS provisioner:

apiVersion: v1
kind: DeploymentConfig
  name: cephfs-provisioner
  replicas: 1
    type: Recreate
        app: cephfs-provisioner
      – args: [ “-id=cephfs-provisioner-1” ]
        command: [ “/usr/local/bin/cephfs-provisioner” ]
        – name: PROVISIONER_NAME
          value: cephfs
        image: ‘ ‘
        name: cephfs-provisioner
      serviceAccount: cephfs-provisioner
  – imageChangeParams:
      automatic: true
      containerNames: [ cephfs-provisioner ]
        kind: ImageStreamTag
        name: cephfs-provisioner:latest
    type: ImageChange
  – type: ConfigChange

And we should finally be able to create PersistentVolumeClaims, requesting CephFS-backed storage.

apiVersion: v1
kind: PersistentVolumeClaim
  name: test-cephfs
  accessModes: [ ReadWriteMany ]
      storage: 1Gi
  storageClassName: cephfs

Having registered the previous object, confirm our volume was properly provisioned:

$> oc get pvc
test-cephfs Bound pvc-xxx 1G RWX cephfs 5h

Then, we would create a Pod mounting that volume:

apiVersion: v1
kind: Pod
  name: pvc-test-cephfs
  – image:
    name: cephfs-rwx
        – KILL
        – MKNOD
        – SETUID
        – SETGID
        privileged: false
    – mountPath: /mnt/cephfs
      name: cephfs
      level: s0:c23,c2
  – name: cephfs
      claimName: test-cephfs-claim

Once that Pod would have started, we should be able to enter and write our volume:

$ mount | grep cephfs
ceph-fuse on /mnt/cephfs type fuse.ceph-fuse (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
$ date >/mnt/cephfs/toto
$ cat /mnt/cephfs/toto
Wed May 15 19:06:20 UTC 2019

At that point, we should not a non-negligible drawback is the fact the CephFS kernel client doesn’t seem to allow reading from or writing to shares, from OpenShift Pods. Strangely enough, using a shell on the OpenShift node hosting that Pod, I can successfully write files and open them back. A few months ago, this was not the case: today, it would seem OpenShift is the main responsible, and next thing to fix.

Today, as a workaround, you would have to install ceph-fuse on all OpenShift nodes. At which point, any CephFS share would be mounted using ceph.fuse, instead of ceph kernel client.

Bearing in mind that CephFS main concurrent, GlusterFS, also uses a fuse-based client – while not providing with any kernel implementation – we can start infering Gluster is living its last days, as the most popular solution offering file-based storage in OpenShift.

Ceph Luminous – 12

In the last few days, Ceph published Luminous 12.1.1 packages to their repositories, release candidate of their future LTS. Having had bad experiences with their previous RC, I gave it a fresh look, dropping ceph-deploy and writing my own Ansible roles instead.

typo listing RGW process

Small typo displaying Ceph Health, if you can notice it

Noticeable changes since Luminous include CephFS being -allegedly- stable. I didn’t test it myself yet, although I’ve been hearing about that feature being unstable since my first days testing Ceph, years ago.

RadosGateway multisite sync status

RadosGateway multisite sync status

Another considerable change that showed up and is now considered stable, is a replacement implementation of Ceph FileStore (relying on ext4, xfs or btrfs partitions), called BlueStore. The main change being that Ceph Object Storage processes would no longer mount a large filesystem storing their data. Beware that recovery scripts reconstructing block devices scanning for PG content in OSD filesystems would no longer work – it is yet unclear how a disaster recovery would work, recovering data from an offline cluster. No surprises otherwise, so far so good.

Also advertised: the RBD-mirror daemon (introduced lately) is now considered stable running in HA. From what I’ve seen, it is yet unclear how to configure HA starting several mirrors in a single cluster – I very much doubt this would work out of the box. We’ll probably have to wait a little longer, for Ceph documentation to reflect the last changes introduced on that matter.

As I had 2 Ceph clusters, I could confirm RadosGW Multisite configuration works perfectly. Now that’s not a new feature, still it’s the first time I actually set this up. Buckets are eventually replicated to remote cluster. Documentation is way more exhaustive, works as advertised: I’ll stick to this, until we learn more about RBD mirroring.

Querying for the MON commands

Ceph RestAPI Gateway

Freshly introduced: the Ceph RestAPI Gateway. Again, we’re missing some docs yet. On paper, this service should allow you to query your cluster as you would have with Ceph CLI tools, via a Restful API. Having set one up, it isn’t much complicated – I would recommend not to use their built-in webserver, and instead use nginx and uwsgi. The basics on that matter could be found on GitHub.

Ceph health turns to warning, watch for un-scrubbed PGs

Ceph health turns to warning, watch for un-scrubbed PGs

Even though Ceph Luminous shouldn’t reach LTS before their 12.2.0 release, as of today, I can confirm Debian Stretch packages are working relatively well on a 3-MON 3-MGR 3-OSD 2-RGW with some haproxy balancer setup, serving with s3-like buckets. Although note there is some weirdness regarding PG scrubbing, you may need to add a cron job … And if you consider running Ceph on commodity hardware, consider that their last releases may be broken.

ceph dashboard

Ceph Dashboard

edit: LTS released as of late August: SSE4.2 support still mandatory deploying your Luminous cluster, although a fix recently reached their master branch, ….

As of late September, Ceph 12.2.1 release can actually be installed on older, commodity hardware.
Meanwhile, a few screenshots of Ceph Dashboard were posted to ceph website, advertising on that new feature.

Reweighting Ceph

For those familiar with the earlier versions of Ceph, you may be familiar with that process, as objects were not necessarily evenly distributed across the storage nodes of your cluster. Nowadays, and since somewhere around Firefly and Hammer, the default placement algorithm is way more effective on that matter.

Still, after a year running what started as a five hosts/three MONs/18 OSDs cluster, and grew up to eight hosts and 29 OSDs, two of them out – pending replacement – it finally happened, one of my disks usage went up to over 90%:

Ceph disks usage, before reweight

Ceph disks usage, before reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.7T 982G 74% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.7T 174G 91% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 1002G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 67M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 73M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 718G 62% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 778G 73% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.0T 749G 74% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 493G 434G 54% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 953G 75% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.2T 1.5T 61% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.1T 1.6T 57% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 61% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.5T 1.3T 55% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

It was time to do something:

ceph:~# ceph osd reweight-by-utilization
SUCCESSFUL reweight-by-utilization: average 0.653007, overload 0.783608. reweighted: osd.23 [1.000000 -> 0.720123]

about two hours later, all fixed:

Ceph disks usage, after reweight

Ceph disks usage, after reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.8T 904G 76% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.2T 638G 66% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 976G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 69M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 75M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 666G 65% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 830G 71% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.1T 696G 76% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 518G 409G 56% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 928G 76% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.3T 1.4T 62% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.6T 1.3T 56% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

one year of ceph

one year of ceph

I can’t speak for IO-intensive cases, although as far as I’ve seen the process of reweighting an OSD or repairing damaged placement groups blends pretty well with my usual workload.
Then again, Ceph provides with ways to priorize operations (such as backfill or recovery), allowing you to fine tune your cluster, using commands such as:

# ceph tell osd.* injectargs ‘–osd-max-backfills 1’
# ceph tell osd.* injectargs ‘–osd-max-recovery-threads 1’
# ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’
# ceph tell osd.* injectargs ‘–osd-client-op-priority 63’
# ceph tell osd.* injectargs ‘–osd-recovery-max-active 1’

While on the subject, last screenshot to celebrate one year running Ceph and OpenNebula, illustrating how much crap I can hoard.

2016 upgrade

Quick post sharing a few pictures I took this February, as I finally replaced my plastic shelf by some steel rack.

I took that opportunity to add an UPS, a third PDU, my 7th & 8th Ceph hosts.
Thanks to Ceph & OpenNebula, I haven’t had to shut down any of my services.

Bonjour Serveurs (suite)

Scaling out with Ceph

A few months ago, I installed a Ceph cluster hosting disk images, for my OpenNebula cloud.
This cluster is based on 5 ProLian N54L, each with a 60G SSD for the main filesystems, some with 1 512G SSD OSD, all with 3 disk drives from 1 to 4T. SSD are grouped in a pool, HDD in an other.

OpenNebula Datastores View - before

OpenNebula Datastores View, having 5 Ceph OSD hosts

Now that most my services are in this cluster, I’m left with very few free space.
The good news is there is no significant impact on performances, as I was experiencing with ZFS.
The bad news, is that I urgently need to add some storage space.

Last Sunday, I ordered my sixth N54L on eBay (from my “official” refurbish-er, BargainHardware) and a few disks.
After receiving everything, I installed the latest Ubuntu LTS (Trusty) from my PXE, installed puppet, prepared everything, … In about an hour, I was ready to add my disks.

I use a custom crush map, and the osd “crush update on start” set to false, in my ceph.conf.
This was the first time I tested this, and I was pleased to see I can run ceph-deploy to prepare my OSD, without automatically adding it to the default CRUSH root – especially having two pools.
From my ceph-deploy host (some Xen PV I use hosting ceph-dash, munin and nagios probes related to ceph, but with no OSD nor MON actually running), I ran the following:

# ceph-deploy install erebe
# ceph-deploy disk list erebe
# ceph-deploy disk zap erebe:sda
# ceph-deploy disk zap erebe:sdb
# ceph-deploy disk zap erebe:sdc
# ceph-deploy disk zap erebe:sdd
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sdb
# ceph-deploy osd prepare erebe:sdc
# ceph-deploy osd prepare erebe:sdd

At that point, the 4 new OSD were up and running according to ceph status, though no data was assigned to them.
Next step was to update my crushmap, including these new OSDs in the proper root.

# ceph osd getcrushmap -o compiled_crush
# crushtool -d compiled_crush -o plain_crush
# vi plain_crush
# crushtool -c plain_crush -o new-crush
# ceph osd setcrushmap -i new-crush

For the record, the content of my current crush map is the following:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host nyx-hdd {
id -2 # do not change unnecessarily
# weight 10.890
alg straw
hash 0 # rjenkins1
item osd.1 weight 3.630
item osd.2 weight 3.630
item osd.3 weight 3.630
host eos-hdd {
id -3 # do not change unnecessarily
# weight 11.430
alg straw
hash 0 # rjenkins1
item osd.5 weight 3.630
item osd.6 weight 3.900
item osd.7 weight 3.900
host hemara-hdd {
id -4 # do not change unnecessarily
# weight 9.980
alg straw
hash 0 # rjenkins1
item osd.9 weight 3.630
item osd.10 weight 3.630
item osd.11 weight 2.720
host selene-hdd {
id -5 # do not change unnecessarily
# weight 5.430
alg straw
hash 0 # rjenkins1
item osd.12 weight 1.810
item osd.13 weight 2.720
item osd.14 weight 0.900
host helios-hdd {
id -6 # do not change unnecessarily
# weight 3.050
alg straw
hash 0 # rjenkins1
item osd.15 weight 1.600
item osd.16 weight 0.700
item osd.17 weight 0.750
host erebe-hdd {
id -7 # do not change unnecessarily
# weight 7.250
alg straw
hash 0 # rjenkins1
item osd.19 weight 2.720
item osd.20 weight 1.810
item osd.21 weight 2.720
root hdd {
id -1 # do not change unnecessarily
# weight 40.780
alg straw
hash 0 # rjenkins1
item nyx-hdd weight 10.890
item eos-hdd weight 11.430
item hemara-hdd weight 9.980
item selene-hdd weight 5.430
item helios-hdd weight 3.050
item erebe-hdd weight 7.250
host nyx-ssd {
id -42 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.460
host eos-ssd {
id -43 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.460
host hemara-ssd {
id -44 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.450
host erebe-ssd {
id -45 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.18 weight 0.450
root ssd {
id -41 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item nyx-ssd weight 1.000
item eos-ssd weight 1.000
item hemara-ssd weight 1.000
item erebe-ssd weight 1.000
# rules
rule hdd {
ruleset 0
type replicated
min_size 1
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
rule ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
# end crush map

Applying the new crush map, a 20 hours process started, moving placement groups.

OpenNebula Datastores View - after

OpenNebula Datastores View, having 6 Ceph OSD hosts

# ceph-diskspace
/dev/sdc1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-6
/dev/sda1 472G 330G 143G 70% /var/lib/ceph/osd/ceph-4
/dev/sdb1 3.7T 2.4T 1.4T 64% /var/lib/ceph/osd/ceph-5
/dev/sdd1 3.7T 2.4T 1.4T 65% /var/lib/ceph/osd/ceph-7
/dev/sda1 442G 329G 114G 75% /var/lib/ceph/osd/ceph-18
/dev/sdb1 2.8T 2.1T 668G 77% /var/lib/ceph/osd/ceph-19
/dev/sdc1 1.9T 1.3T 593G 69% /var/lib/ceph/osd/ceph-20
/dev/sdd1 2.8T 2.0T 808G 72% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 562G 365G 61% /var/lib/ceph/osd/ceph-17
/dev/sdb1 927G 564G 363G 61% /var/lib/ceph/osd/ceph-16
/dev/sda1 1.9T 1.2T 630G 67% /var/lib/ceph/osd/ceph-15
/dev/sdb1 3.7T 2.8T 935G 75% /var/lib/ceph/osd/ceph-9
/dev/sdd1 2.8T 1.4T 1.4T 50% /var/lib/ceph/osd/ceph-11
/dev/sda1 461G 274G 187G 60% /var/lib/ceph/osd/ceph-8
/dev/sdc1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-10
/dev/sdc1 3.7T 1.9T 1.8T 52% /var/lib/ceph/osd/ceph-1
/dev/sde1 3.7T 2.0T 1.7T 54% /var/lib/ceph/osd/ceph-3
/dev/sdd1 3.7T 2.3T 1.5T 62% /var/lib/ceph/osd/ceph-2
/dev/sdb1 472G 308G 165G 66% /var/lib/ceph/osd/ceph-0
/dev/sdb1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-12
/dev/sdd1 927G 580G 347G 63% /var/lib/ceph/osd/ceph-14
/dev/sdc1 2.8T 2.0T 813G 71% /var/lib/ceph/osd/ceph-13

I’m really satisfied by the way ceph is constantly improving their product.
Having discussed with several interviewers in the last few weeks, I’m still having to explain why ceph rbd is not to be confused with cephfs, and if the latter may not be production ready, rados storage is just the thing you could be looking for distributing your storage.

Don’t trust the Tahr

Beware that since latest Ubuntu kernel upgrades (14.04.02), you may lose network rebooting your servers!

I’ve had the problem four days ago, rebooting one of my OpenNebula hosts. Still unreachable after 5 minutes, I logged in physically, to see all my “p1pX” and “p4pX” interfaces had disappeared.
Checking udev rules, there is now a file fixing interfaces mapping. On a server I have not rebooted yet, this file doesn’t exist.

The story could have ended here. But with Ubuntu, updates is a daily struggle: today, one of my ceph OSD (hosting 4 disks) spontaneously stopped working.
Meaning: the host was still there, I was able to open a shell using SSH. Checking processes, all ceph osd deamon were stopped. Starting them showed no error, while processes were still absent. Checking dmesg, I had several lines of SSL-related segfaults.
As expected, rebooting fixed everything, from ceph, to my network interfaces names.
It’s in these days I most enjoy freelancing: I can address my system and network outages in time, way before it’s too late.

While I was starting to accept Ubuntu as safe enough to run production services, renaming interfaces on a production system is unacceptable. I’m curious to know how Canonical dealt with that providing BootStack and OpenStack-based services.

Note there is still a way to prevent your interfaces from being renamed:

# ln -s /dev/null /etc/udev/rules.d/75-persistent-net-generator.rules

OwnCloud & Pydio

You may have heard of OwnCloud at least, if you’re not using one already. it with more than a couple users.

Thanks to a very fine web frontend, and several standalone clients allowing to to use your shares as a network file system, OwnCloud is user friendly, and could be trusted hosting hundreds of accounts, if not thousands.
The solution was installed in Smile, by Thibaut (59pilgrim). I didn’t mind that much, back then, I was using Pydio, and pretty satisfied already. We had around 700 users, not all being active, yet I could see the whole thing was pretty reliable.

Pydio is a good candidate to compare with OwnCloud. Both offer pretty much the same services. OwnCloud has lots of apps to do everything, Pydio has plugins. Both are PHP-based opensource projects, with fairly active communities.
Small advantage to OwnCloud though, with his native S3 connector. And arguably, a better linux client and web experience.

Recently, disappointed by Pydio – something about having \n in file names, preventing files from being uploaded to my Pydio server – I gave a shot to OwnCloud.
I haven’t lost hope in Pydio yet, but OwnCloud is definitely easier to deal with: I could even recommend it to a novice Linux enthusiast.

Ceph Disk Failure

Last week, one of my five Ceph hosts was unreachable.
Investigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing the connection).

After rebooting the system and looking at logs, I could see a lot of I/O errors. I left the console logged in to root, waiting for the next occurrence.
Having no spare 60GB SSD, I ordered one.

Two days later, the same problem occurred. From the console, I was unable to run anything (mostly segfaults and ENOENT).
Again, I was able to reboot. This time, I dropped a couple LVMs, unmounted the swap partition, and resized my VG to make sure I had a fair amount of unallocated space on my faulty disk.

The problem persisted, while average uptime was significantly getting lower.
I progressively disabled local OSSEC daemon, puppet, a few crontabs, collectd, munin, … only keeping ceph, nagios and ssh running. The problem kept happening, every 12 to 48 hours.

This morning, the server wasn’t even able to boot.
Checking the BIOS, my root SSD wasn’t detected.
Attaching it to some USB dock, I had to wait a couples minutes before the disk was actually detected by my laptop (Ubuntu 14.04.02), and my desktop (Debian 7.8).
I caught a break when receiving my new disk at 11 AM.
Running dd from the faulty disk to the new one took around 50 minutes (20MB/s, I can’t believe it!).
Syncing (1x512G SSD, 2*4T & 1*3T HDD) after 8 hours of downtime, took around half an hour. Knowing I run a fairly busy mail server, some nntp index, …), this is a new tangible improvement brought by Hammer, over Firefly.

I’m now preparing to send the faulty disk to my re-seller, for replacement. At least, I would have one handy, for the next failure.

Morality: cheap is unreasonable. Better be lucky.


Having recently finished to re-install my cloud environments, I am now focusing on setting back up my supervision and monitoring services.

Last week, a friend of mine (Pierre-Edouard Gosseaume) told me about his experience with ceph-dash, a dashboard for Ceph I hadn’t heard from back then.
Like most ceph users, I’ve heard of Calamari. A languages, frameworks and technologies orgie I’ve ended up building by myself, and deploying on a test cluster I used to operate in Smile.
Calamari is sort of a fiasco. The whole stack gets fucked up by the underlying component: Saltstack.
Saltstack is yet another configuration deployment solution such as Puppet, Ansible or Rundeck.
Using Calamari, the calamari-server instance would use saltstack to communicate with its clients. As far as I could see, saltstack service randomly stops running on clients, until no one is responding to our server queries. A minute-based cron is required to keep your queries somewhat consistent. It’s a mess, I’ve never installed calamari on a prod cluster, and would recommend waiting at least for some pre-packaged release.

So, back to ceph-dash.
My first impression was mitiged, at best. Being distributed on github, by some “Crapworks“, I had my doubts.
On second thoughts, you can see they have a domain, Deutsche kalität, maybe germans grant some hidden meaning to the crap thing, allright.

Again, there’s no package shipped. And as of Calamari, ceph-dash makefile allows you to build deb packages.
Unlike Calamari, ceph-dash is a very lightweight tool, based mostly on python, inducing low overhead, and able to run fully deported of your ceph cluster.
Even if the documentation tells you to install ceph-dash onto a MON host of your cluster, you may as well install it to some dedicated VM, as long as you do have installed the right librados, have a correct /etc/ceph/ceph.conf, and can use a valid keyring accessing the cluster.

ceph-dash ships with a small script running your service for tests purposes. It also ships with the necessary configuration for Apache integration, easily convertible to Nginx.
The zero-to-dashbord is done in about 5 minutes. Which again, is vastly different from my experience with Calamari.
The major novelty being, it actually works.


A long story, binds me with Ceph (partially documented on my wiki).
Mostly at Smile, but also at home, I tested versions from 0.47 to what is now 0.87.
By Christmas 2014, I bought myself 5 Prolian MicroServer, trying to allocate dedicated physical resources to manage virtual file systems.
Nodes all contains a 60G SSD, used for the root filesystem, as well as OSD journals. 1 512GB SSD disk, for “fast” filesystems, and 3 disks from 1 to 4T, filling up the left slots.

Almost a month later, while at work, one of my node stopped answering. Immediately, the cluster tried to recover degraded placement groups, by duplicating the remaining replica to some free space.
At some point, the recovery process filled up a disk until reaching its limit. The cluster was now missing space.
When I realized the problem, I left work early, rushed home and reboot the failing server.
Recovering the missing disks, the cluster remained on a degraded state, because of the filled disk from earlier. The daemon managing this disk was unable to start, because of the disk being too full. So I ended up dropping its content, reformatting the disk, hoping I would still have an other replica of whatever I was destroying.
On the bright side, the cluster started re-balancing itself, I could finally restart my VMs, … Retrospectively, I don’t see how else I could have get it back up otherwise.
Meanwhile, I actually did lost data in the process. One placement group, remaining degraded.

The cluster being unusable, and yet storing somewhat relevant data for my personal use, I ended up creating a new pool (implying: new placement groups), re-creating my cloud storage space from scratch.

After two weeks on ceph IRC, I found one person with the same problem, no one with an actual solution.
I vaguely heard of the possibility to `cat’ files from osd partitions, to reconstruct an image file matching the one presented by ceph. Pending further investigations, …

And here we are, a few months latter.
The situation is getting worse every day.
I’m finally investigating.

Basically, you have to look at rbd infos to retrieve a string, contained by all names of files holding your data:

moros:~# rbd -p disaster info one-70
rbd image ‘one-70’:
size 195 GB in 50001 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.66d52ae8944a
format: 2
features: layering

From there, you’ll use a script to list all files you actually need to reconstruct your disk.

In my case, the script creates a directory $disk_name, and stores a file per physical host, listing corresponding blocks.

Then, you’ll use a script to store all these files into a single directory, for further processing.
Note file names all contains a backslash, therefor our script would need to connect to the OSD host and then, run some scp command to send the file to designed destination. Such mess requires to exchange SSH keys, … be warned.
An other way to do this may be to share your OSD roots, most likely using NFS, and mount them on the host reconstructing data.

Finally, we can follow Sebastien Han’s procedure, relying on rbd_restore.

You should end up with a disk image file.
If your disks all contains partitions, which is my case, … then fdisk -l the obtained file.
Get the offset where starts the partition you would like to recover, and the block size. Run dd if=your_image bs=block_size skip=start_offset of=$disk_from_partition.
Run fsck on the image obtained.
If you see complaints about bad geometry: block count xx exceeds size of device (yy blocks), then fire up resize2fs.
Assuming xfs, if your dmesg tells about attempt to access beyond end of device, look for a want=XXX, limit=YYY line to deduce the amount of space missing from your disk, then using dd if=/dev/zero of=restored_image seek=${actual size + length to add in MB} obs=1M count=0 should append zeroes to your image, allowing you to mount your disk.
An exhaustive log is available there.