Munin is a popular supervision solution mostly based on perl.

As of most supervision and monitoring solution, munin relies on an agent you would install on hosts to track metrics from, and a server gathering metrics and serving them via CGI, generating graphics from collected data.
As of most supervision and monitoring solutions, munin uses loadable probes (plugins) to extract metrics.

The server side consists of a crontab. Iterating on all your hosts, it would list and execute all available probes, store the results into RRD.
Over thousand hosts, you may need to export munin working root to some tmpfs storage, adding a crontab to regularly rsync the content from this tmpfs to a persistent directory. In any case, hosting your filesystem on some SSD is relevant.

The main advantage of munin over its alternatives, I think, is its simplicity.
Main concurrent here being Collectd, which default web front end is particularly ugly and unpractical. And alternate front ends such as SickMuse are nice-looking, but still lack from simplicity, while some require excessively redundant configuration before being able to present something comparable to munin, in terms of relevance and efficiency.

It is also very easy, to implement your own probes. From scripts to binaries, with a simple way to set contextual variables – evaluated when the plugin name matches the context expression – and requiring a single declaration on the server to collect all metrics served by a node: I don’t know of any supervision solution easier to deploy. And to incorporate to your orchestration infrastructure.

Finally, debugging probes is conceptually easy: all you need is telnet.

muninone:~# telnet 4949
Connected to
Escape character is '^]'.
# munin node at
cpu df load memory ntp_kernel_err ntp_kernel_pll_freq ntp_kernel_pll_off open_files processes users vmstat
fetch cpu
user.value 3337308
nice.value 39168
system.value 13389003
interrupt.value 6679408
idle.value 701005749
Connection closed by foreign host.

Here is a bunch of plugins I use, that you won’t find in default installations. Most of them being copied from public repositories such as github.

Let’s introduce the few probes I wrote, starting with pool_. You may use it to wrap any munin probe. Its main advantage being it consists on a single cat, it answers quickly, and prevent you from generating discontinuous graphics.
I used it in conjunction with the esx_ probe (found somewhere / patched, mostly not mine), which may take several seconds to complete. When a probe takes time to answer, munin times out, discarding data. Using a crontab pooling and keeping your metrics to temporary files independently from munin traditional metrics retrieval process, and using the pool_ probe to present these temporary file is a way to deal with these probes, requiring unusual processing time.

Less relevant, the freebox_ probe, for Freebox V5 (inspired by a similar probe, dealing with Freebox V6):

# munin node at
fetch freebox_atm
atm_down.value 6261
atm_up.value 1025
fetch freebox_attenuation
attenuation_down.value 44.00
attenuation_up.value 24.80
fetch freebox_snr
snr_down.value 6.40
snr_up.value 7.20
fetch freebox_status
status.value 1
fetch freebox_uptime
uptime.value 131.73

Multi Router Traffic Grapher

Also known as MRTG, this solution is perfect to generate graphs from SNMP checks.
MRTG would most likely be used to graph bandwidths and throughput: running their configuration generation utility (cfgmaker) without options would output everything related to your SNMP target network interfaces. Although, MRTG is virtually able to deal with anything your SNMP target can send back (disks, load, memory, …).

MRTG is very modular. Metrics may be stored in several formats, and viewed by several interfaces.
Shipped with its own html indexes generators, MRTG main task is to collect SNMP data.

Default web user interface is more of test page, than an actual supervision solution front page.
From there, several solutions came up, others grew, aiming to wrap MRTG greatness on some more-or-less powerful interfaces.
14all.cgi is one of the more popular frontend to MRTG, providing with a very simple, yet powerful, interface to your metrics.
router2.cgi looks more exhaustive on paper – I still haven’t tested it yet.
Solutions such as Cacti as well, may also use MRTG.

An other popular solution to use over MRTG, is PHP-Weathermap (not to confuse with Weathermap4rrds, their configuration syntax being almost identic, the latter lacks a few features over the original).
Derivatives may be found on OVH own network weathermap (by the way, my congratulations to the intern that most likely spend several days, configuring this map). You’ve got the idea: placing and connecting your objects, associating your rrdtool collections to specific links.

So far, I’ve mostly used 14all.cgi, which is awesome dealing with graphs – generating graphs by clicking specific points on an image is not possible, using generic MRTG interface.
Although, I’m still not convinced by the overall presentation of these graphs. While working for Smile, my manager was nagging me about this. I ended up writing my own client, re-using munin CSS and 14all.cgi graphs. Then, adding weathermap support. And finally, keeping a copy of generated maps to provide with an history presented to web clients.


Having recently finished to re-install my cloud environments, I am now focusing on setting back up my supervision and monitoring services.

Last week, a friend of mine (Pierre-Edouard Gosseaume) told me about his experience with ceph-dash, a dashboard for Ceph I hadn’t heard from back then.
Like most ceph users, I’ve heard of Calamari. A languages, frameworks and technologies orgie I’ve ended up building by myself, and deploying on a test cluster I used to operate in Smile.
Calamari is sort of a fiasco. The whole stack gets fucked up by the underlying component: Saltstack.
Saltstack is yet another configuration deployment solution such as Puppet, Ansible or Rundeck.
Using Calamari, the calamari-server instance would use saltstack to communicate with its clients. As far as I could see, saltstack service randomly stops running on clients, until no one is responding to our server queries. A minute-based cron is required to keep your queries somewhat consistent. It’s a mess, I’ve never installed calamari on a prod cluster, and would recommend waiting at least for some pre-packaged release.

So, back to ceph-dash.
My first impression was mitiged, at best. Being distributed on github, by some “Crapworks“, I had my doubts.
On second thoughts, you can see they have a domain, Deutsche kalität, maybe germans grant some hidden meaning to the crap thing, allright.

Again, there’s no package shipped. And as of Calamari, ceph-dash makefile allows you to build deb packages.
Unlike Calamari, ceph-dash is a very lightweight tool, based mostly on python, inducing low overhead, and able to run fully deported of your ceph cluster.
Even if the documentation tells you to install ceph-dash onto a MON host of your cluster, you may as well install it to some dedicated VM, as long as you do have installed the right librados, have a correct /etc/ceph/ceph.conf, and can use a valid keyring accessing the cluster.

ceph-dash ships with a small script running your service for tests purposes. It also ships with the necessary configuration for Apache integration, easily convertible to Nginx.
The zero-to-dashbord is done in about 5 minutes. Which again, is vastly different from my experience with Calamari.
The major novelty being, it actually works.


An other day, an other service to deploy, an other chance to see Jessie in action.
I’ve been using subsonic for a couple years now, and am quite satisfied by the product.

The practical bad side of Subsonic is that is relies on Java.
Meanwhile, there isn’t a lot of alternatives dynamically serving media behind an HTTP server. The solution everyone talk about is Ampache – for those who haven’t dealt with it yet, the web user interface looks disappointingly outdated. So much so, Subsonic is actually the only frontend I would recommend, serving music libraries.

Dealing with ever-growing libraries, regular clients such as Clementine, Amarok, Banshee, … are eating all my desktop RAM, when they don’t fuck with my IO scanning libraries directory trees, …
At some point, java resources consumption is actually lower than any graphical client. Plus, being available through HTTP, your database management could be deported to some virtual environment, while the player could be handled by any flash-capable web client, or a wide range of applications implementing subsonic client API.

An other inconvenient about subsonic is that a few features are locked, until you end up paying for your license.
A few years ago, one was able to buy a permanent license: investing something like 10$, a friend (Paul) activated its subsonic and never had to pay again.
Last year, license plans changed. During a few months, the lifetime license disappeared. Came back, as far as I can see today, and now costs 99$.

Why pay for Subsonic premium services?
You most likely won’t need most of the features involved. I still don’t.
Although, an other friend (Clement) explained me after one month trying Subsonic on his VPS, he was unable to cache new media on its Android phone, using Subsonic official application.

Why would you pay, then?
You don’t, actually. I can’t find back the thread on Subsonic forums: once upon a time, there was a discussion reporting some info I formerly read on some blog, about the possibility to use the developers test account as yours to enable premium features on your Subsonic instance. Quickly following that post, the main developer answered, telling it won’t be possible anymore.
Today, the only reference to it, on subsonic web site, is actually an error message, inviting you to renew your subscription.
From my point of view, this is more of a communication operation, than an actual fix. Indeed, you would still be able to use the very same login and registration key to enable premium features.

Originally, you only needed to add a few lines to your

Today, you also need to add the following to you /etc/hosts:

Restart subsonic service, enjoy premium.

Concluding on a comment regarding Jessie, you might have notice the ffmpeg package did not make it to Jessie official repositories. Which is no surprise for some aficionados – and definitely was for me.
It seems Debian Security team had no time to deal with ffmpeg. Yet dealt with libav. Integrated systemd. And killed kfreebsd.


The last few days, I’ve been re-creating my ceph cluster from scratch.
After a first disaster in january, leading to the loss of a placement group (over 640), and most recenty the loss of my only monitor, I decided to start from a clean slate.
After two days importing a few terabytes, the main services are back up, and I took the afternoon to install for the very first time a streaming solution I’ve been hearing good things about: Plex.

As far as I can tell, it works pretty well.
Two things I would complain about, starting with the debian packaging, that assume there is no systemd on debian – ironically, the script handles systemd on ubuntu after 14.04: a test is missing to be fully jessie-compliant.
The second thing may be a PEKAC, I still need to investigate. It appears when I’m streaming more than two media, I get errors about the server not being able to open the source media.

Plex lets you browse your media libraries. Scanning directory trees, matching files for either Movies, TV Shows, Music or Home Movies.
Beforehand, you would need to ensure your libraries are properly named, according to Plex standards. If you prefer keeping your layout intact, you may write some script linking your media to some alternative library root, allowing plex to properly deal with your medias.

Starting with my movies and series the first day, I quickly wrote an other script linking my music medias to a directory according to plex music libraries naming directives.
Checking out all plex menus, I ended up configuring Channels, connecting to my YouTube, Vimeo and SoundCloud accounts.

In the end, Plex is very much more, than I was expecting in the first place.
It’s doing quite well, in everything I have tested right now. The video player easily allows you to switch audio track, subtitle track, media quality. Transcoding may induces relatively huge CPU load. When streaming starts jerking, look at the “settings-like” icon, either pick Original transcoding quality, or just lower your bitrate.

Last comment: note the runtime user of Plex need to have write access to your libraries directory – folders not allowing writes would not be scanned. Even if I’ve found no proof of anything being written there, ATM.

And let’s finish on a script used with some SickBeard backend, rewriting the your episodes metadata (mtime) to match their actual release date.


This is no WMD, but a German radio station transmitting current date, for time synchronization purposes.

I first stumbled upon DCF77 while working for CamTrace.
Some of our customers insisted on not connecting their video-surveillance network to the Internet.

The first solution coming to mind is to set up some NTP server. Which would work, most of the time.
Although, you should know NTP uses a stratum, keeping track of how high you are in the food chain. Having a lower value is critical, to allow other devices to synchronize their clock against yours.
Our consumers may restart their server, not fully understanding those concepts. But a client configuring his camera right after booting his server is not wrong. Thus, a good way ensuring your stratum is low enough so that devices may always accept what your NTP server is answering, is to get your time from a reliable source. Such as some german-kalität radio clock. Note this solution is only viable in the 2.000kms around Frankfurt though.

I’m mostly familiar with Axis networks cameras, but assume the problem presents itself on anything not embedding some RTC (like Raspberry Pi on default configuration, module being sold separately) : setting the date installing the device is a mistake, knowing it would fluctuate relatively quickly (worst cases, we’re talking about a few minutes per day, quickly adding up to hours, …).

Thus, I’ve combined some Raspberry Pi to a MouseCLOCK USB. NTP allows interfacing with such clock. A minimalistic /etc/ntpd.conf would look like this:

server mode 19
fudge stratum 0
fudge time1 0.042

Using as source tells ntp to look for /dev/refclock-0 device. Assuming FreeBSD, your clock device would show up as /dev/cuaU$x. Linux, /dev/ttyUSB$x. You may add to ntp startup the following lines:

if /usr/local/sbin/lsusb | grep mouseCLOCK >/dev/null; then
if test -c /dev/cuaU0; then
test -h /dev/refclock-0 || ln -s /dev/cuaU0 /dev/refclock-0

A cleaner way to create this link could be using devfs. In FreeBSD, add to /etc/devfs.conf the line cuau0 refclock-0.
On Debian, use udevd creating a file in /dev/udev/rules.d matching your device, like KERNEL==”ttyUSB*”, ATTRS{idProduct}==”e88a”, SYMLINK+=”refclock-%m”.

After a while, querying ntp should confirm proper synchronization:

root@cronos:/usr/ports # ntpq -crv -pn
assID=0 status=0264 leap_none, sync_lf_clock, 6 events, event_peer/strat_chg,
version="ntpd 4.2.4p5-a (1)", processor="arm",
system="FreeBSD/10.1-STABLE", leap=00, stratum=1, precision=-19,
rootdelay=0.000, rootdispersion=1026.460, peer=25445, refid=DCFa,
reftime=d8e8e24f.1ec18939 Mon, Apr 27 2015 18:29:03.120, poll=6,
clock=d8e8e25d.bb054ad7 Mon, Apr 27 2015 18:29:17.730, state=4,
offset=-77.794, frequency=472.839, jitter=10.929, noise=27.504,
stability=0.197, tai=0
remote refid st t when poll reach delay offset jitter
* .DCFa. 0 l 14 64 1 0.000 -77.794 10.929
root@cronos:/usr/ports #

Jessie, out

Yesterday, Debian announced the release of Jessie as their new stable release, after 24 months of development and patches

First and foremost, Jessie comes with systemd – sysvinit being still available.
Let’s review the fundamentals, before unleashing the troll:

  • Kernel 3.16
  • LibVirt 1.2.9
  • Xen 4.4
  • Vzctl 4.8
  • Asterisk 11.13
  • Puppet 3.7
  • Apache 2.4.10
  • Bind 9.9
  • GCC 4.9
  • PHP 5.6
  • PostgreSQL 9.4
  • MySQL 5.5.43
  • Samba 4.1

Now let’s go back on Systemd, and to Lennard Poettering “gifts” to the community, in general.

Systemd is controversial for many reasons. At first, it was about re-inventing the wheel.
Systemd is supposed to replace sysvinit scripts – powering most of your Unix distributions since 20 years or so.
Although, systemd relies on Linux Kernel specifics, and thus is not portable to BSD. On that subject, Poettering tells us even if his solution would be portable, BSD users won’t switch to it, so there’s no problem in creating a sysvinit alternative targeted to Linux systems (which feels like it could be true, but also sounds like racism), and anyway, porting it would require time investigating on alternatives (see #13 and #15 fom Lennard response to its detractors).
The thing is, systemd does not only replace sysvinit, but also manage file system mounting, power and devices management, disk encryption, network configuration, sessions management, GPT partitions discovery, … Some may argue that Systemd goes against Unix philosophy of “doing one little thing, and do it well”.

From the practical point of view, systemd stores its logs using journald, and binary files. Logs are thus corruptible, and can’t be manipulated as easily as traditional plain-text files.
Being dependend on Linux kernel, using a perticular version of systemd implies running a compatible version of Linux kernel.
Systemd differs from sysvinit by being non-deterministic, non-predictible. Its process is hidden within init, risking to bring down the whole system by crashing. Meanwhile, a plenty of non-kernel system upgrades would now require you to restart your system anyway – tastes like Windows, feels like Windows, …

So why trying to fix something that is working? Why main distributions all seems to switch to systemd, when there is other replacement such as openRC, or even sysvinit?

By nature, integrating and re-designing Unix core components such as udev or dbus unilaterally involve both having all components systematically installed (when a server does not automatically need them), as well as their interface being progressively and obscurely rewritten, providing new interfaces. At some point, GNOME started using logind from systemd, to authenticate users. Setting up GNOME without systemd became acrobatic. More and more distributions are forced to integrate systemd.

Systemd could be described as a system daemon, sometimes referred as a “basic userspace building block to make an OS from”, engulfing features formerly attributed to a syslog daemon, your network or device managers, … Systemd’s spread is symbolic, showing a radical shift in thinking within the Linux community: desktop-oriented, choice-limiting, isolationist. Collateral victim being Debian/kFreeBSD, guilty of not using the holy Linux kernel.

Finishing on a parenthesis regarding kFreeBSD architecture of Debian: despite being encouraged to build it by myself, I’m a bit puzzled.
On one hand, you have the installation-guide-kfreebsd-amd64 package, containing a few HTML files with the well-known documentation, broken links to non-existing kernel and initrds. On the other hand, the kfreebsd-source-10.1 package, which Makefile doesn’t seem to work at all. Hooray for Poettering.


BlueMind is a mail solution, based on a few popular softwares such as Cyrus, Postfix, ElasticSearch or nginx, glued together using several java services.

Being still under active development: look closely after each update.
A good practice may be to list all ports in a LISTENING state, and the process it is related to. It’s been helping me several times, identifying which process is down, without restarting the whole enchilada.

BlueMind SMTP access is assumed by Postfix (*:25). Incoming mails are passed to a java process (bm-lmtp,, that would eventually deliver its mail onto Cyrus (*:24).
BlueMind IMAP is served by nginx (*:143), proxifying requests to Cyrus (*:1143).
BlueMind Webmail is served by nginx as well. You may recognize a fork of Roundcube. PHP is served by php-fpm, former versions involved apaches’ mod_php.

On the bright side, BlueMind provides with a fully-functionnal (and exchange compliant) mail solution. If my previous tests (2.x branch) suffered from poor cyrus configuration (could process 4 mails at a time, too slow to handle the tens of mailing-lists I’m subscribed to). On my current seutp (3.14), I had no need to modify anything.
Although, BlueMind Community Edition does not support applicative updates: it is mandatory to re-install a new server, and the migrate your mailboxes. You may otherwise become a partner (reseller, integrator, …) or most likely, a paying customer, to be provided with BlueMind updates – in which case, you would most likely be assisted by a developer from BlueMind during the update process and your first hours, running the new version.

There’s nothing much to tell about BlueMind administration console. The most interesting parts being system-management related, and especially mailbox backups, archiving mails on an age (and/or location) basis.
On that subject: BlueMind backups are stored on /var/backups/bluemind. Ideally, try to keep this folder onto a different physical medium.

For those familiar with Cyrus, you may use cyradm command.
Retrieve your admin password:

deepthroat:/etc# grep proxy /imapd.conf
proxy_authname: admin0
proxy_password: some-alpha-numeric-chain

You may then connect:

deepthroat:/etc# cyradm -u admin0 localhost
[enter the proxy_password value]

Sometimes, shit append, and you may see IOERROR messages, followed by header CRC mismatch, invalid cache record, mailbox format corruption detected, … In which case, you would need to run something like:

deepthroat:/etc# /usr/lib/cyrus/bin/reconstruct user/admin/

Assuming you are using several postfix instances, relaying mails to your BlueMind server, then try to replicate virtual_aliases, virtual_domains and virtual_mailbox maps everywhere – instead of letting everything pass, and delegating actual consistency checks to your BlueMind instance.

Finally, note there is one critical mail service, that BlueMind does not provide: SPAM (and/or virus) checking.
Assuming you can deal with spamassassin and clamav, you may want to alter postfix routing prior to sending your mail to bm-lmtp. Assuming you want to educate your spamassassin with user input, keep it running close to your cyrus server.


A long story, binds me with Ceph (partially documented on my wiki).
Mostly at Smile, but also at home, I tested versions from 0.47 to what is now 0.87.
By Christmas 2014, I bought myself 5 Prolian MicroServer, trying to allocate dedicated physical resources to manage virtual file systems.
Nodes all contains a 60G SSD, used for the root filesystem, as well as OSD journals. 1 512GB SSD disk, for “fast” filesystems, and 3 disks from 1 to 4T, filling up the left slots.

Almost a month later, while at work, one of my node stopped answering. Immediately, the cluster tried to recover degraded placement groups, by duplicating the remaining replica to some free space.
At some point, the recovery process filled up a disk until reaching its limit. The cluster was now missing space.
When I realized the problem, I left work early, rushed home and reboot the failing server.
Recovering the missing disks, the cluster remained on a degraded state, because of the filled disk from earlier. The daemon managing this disk was unable to start, because of the disk being too full. So I ended up dropping its content, reformatting the disk, hoping I would still have an other replica of whatever I was destroying.
On the bright side, the cluster started re-balancing itself, I could finally restart my VMs, … Retrospectively, I don’t see how else I could have get it back up otherwise.
Meanwhile, I actually did lost data in the process. One placement group, remaining degraded.

The cluster being unusable, and yet storing somewhat relevant data for my personal use, I ended up creating a new pool (implying: new placement groups), re-creating my cloud storage space from scratch.

After two weeks on ceph IRC, I found one person with the same problem, no one with an actual solution.
I vaguely heard of the possibility to `cat’ files from osd partitions, to reconstruct an image file matching the one presented by ceph. Pending further investigations, …

And here we are, a few months latter.
The situation is getting worse every day.
I’m finally investigating.

Basically, you have to look at rbd infos to retrieve a string, contained by all names of files holding your data:

moros:~# rbd -p disaster info one-70
rbd image ‘one-70’:
size 195 GB in 50001 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.66d52ae8944a
format: 2
features: layering

From there, you’ll use a script to list all files you actually need to reconstruct your disk.

In my case, the script creates a directory $disk_name, and stores a file per physical host, listing corresponding blocks.

Then, you’ll use a script to store all these files into a single directory, for further processing.
Note file names all contains a backslash, therefor our script would need to connect to the OSD host and then, run some scp command to send the file to designed destination. Such mess requires to exchange SSH keys, … be warned.
An other way to do this may be to share your OSD roots, most likely using NFS, and mount them on the host reconstructing data.

Finally, we can follow Sebastien Han’s procedure, relying on rbd_restore.

You should end up with a disk image file.
If your disks all contains partitions, which is my case, … then fdisk -l the obtained file.
Get the offset where starts the partition you would like to recover, and the block size. Run dd if=your_image bs=block_size skip=start_offset of=$disk_from_partition.
Run fsck on the image obtained.
If you see complaints about bad geometry: block count xx exceeds size of device (yy blocks), then fire up resize2fs.
Assuming xfs, if your dmesg tells about attempt to access beyond end of device, look for a want=XXX, limit=YYY line to deduce the amount of space missing from your disk, then using dd if=/dev/zero of=restored_image seek=${actual size + length to add in MB} obs=1M count=0 should append zeroes to your image, allowing you to mount your disk.
An exhaustive log is available there.


You may be familiar with eCryptfs, a disk encryption software, especially known for being shipped with Ubuntu default installations, being the ‘Encrypted Home’ underlying solution.

From what I gathered of empirical experiences, I would say eCryptfs, AKA Enterprise Cryptographic Filesystem, should be avoided in both enterprise and personal use cases.

My first surprise was while crawling a site for its record: creating a file per author, I ended up with more than 5000 files in a single directory. At which point, your dmesg should show something like several ‘Valid eCryptfs header not found in file header region or xattr region, inode XXX‘, then a fiew ‘Watchdog[..]: segfault at 0 ip $addr1 sp $addr2 error 6 in[$addr3]

The second surprise, a few days later, while recovering rbd from my ceph cluster. Storing all parts from a disk into a same directory, again, I ended up with folders holding several thousands of file, and my dmesg going mad.
You would notice moving your folder outside of your ecryptfs directory will fix the problem.

Of course, most users won’t recover their ceph nor crawl a site for its complete database. Although, configuring Thunderbird/IceDove, you may end up caching enough mails from your imap/pop server to reach the limits I did.

This is not my first experience with cryptographic solutions.
Once upon a time, TrueCrypt was offering a very exhaustive toolkit, managing from devices to files – so much so, my end-of-studies project was bout forking TrueCrypt, adding features the maintainer did not wanted to see in its product (BootTruster).
On today’s Linux systems, a more classic way to do it would be to use Luks (Linux Unified Key Setup), based on a standardized device-mapper: dm-crypt.

Anyway: next time you’re asked about encrypting your home file system, think twice about what solution you’re going to use. Then, chose Luks.