Menu

OwnCloud & Pydio

You may have heard of OwnCloud at least, if you’re not using one already. it with more than a couple users.

Thanks to a very fine web frontend, and several standalone clients allowing to to use your shares as a network file system, OwnCloud is user friendly, and could be trusted hosting hundreds of accounts, if not thousands.
The solution was installed in Smile, by Thibaut (59pilgrim). I didn’t mind that much, back then, I was using Pydio, and pretty satisfied already. We had around 700 users, not all being active, yet I could see the whole thing was pretty reliable.

Pydio is a good candidate to compare with OwnCloud. Both offer pretty much the same services. OwnCloud has lots of apps to do everything, Pydio has plugins. Both are PHP-based opensource projects, with fairly active communities.
Small advantage to OwnCloud though, with his native S3 connector. And arguably, a better linux client and web experience.

Recently, disappointed by Pydio – something about having \n in file names, preventing files from being uploaded to my Pydio server – I gave a shot to OwnCloud.
I haven’t lost hope in Pydio yet, but OwnCloud is definitely easier to deal with: I could even recommend it to a novice Linux enthusiast.

Asterisk

Asterisk is an open-source framework for building communication-based applications.
Historically, Asterisk is an alternative to most proprietary Private Branch eXchange (PBX) systems, dealing with voice communications, conference calling IVR or voicemails.

Quite modular, Asterisk is shipped with several audio codecs (g711a, g711u, g722, gsm), handles standard protocols (SIP, IAX), and could be used virtually anywhere from multi-tenants providers, to end-user setups.

There’s a lot of Asterisk-based distributions, starting with FreePBX, and derivatives such as AsteriskNow, Elastix, or alternatives such as PBXinaflash.
The purpose of these systems is to provide end-users with a clear web interface managing their setup.
This is usually a good way to manage your setup. Although, when dealing with several servers, all with their local dialplans, configuring trunks, routes, user extensions, … and guaranteeing all your users are offered with the very same service, you will spend quite a lot of time doing repetitive checks, and sporadically fixing typos and unexpected configurations.

Before leaving Smile, I worked on a puppet class that could deploy asterisk and configure everything from hiera arrays. No frontend, except for some nginx distributing phone configurations. Minimalistic setup, based on Elastix/ASTDB-based generated contexts and embedded applications.
I didn’t have the time, nor the guts to finish it. Today, I have a working PoC, involving my Freephonie SIP account, a couples softwares and hardware phones, voicemails, DND, CFW, …
And last but not least: hardware phones default configuration locks them to a private context. Users may dial their extension number and authenticate themselves using a PIN number to get their phone re-configured with their extension.

Most of the work is publicly available on my gitlab.

Ceph Disk Failure

Last week, one of my five Ceph hosts was unreachable.
Investigating, I noticed the OSD daemons were still running. Only daemons using the root file system, where either crashed (the local ceph MON daemon) or unable to process requests (SSH daemon was still answering, cleanly closing the connection).

After rebooting the system and looking at logs, I could see a lot of I/O errors. I left the console logged in to root, waiting for the next occurrence.
Having no spare 60GB SSD, I ordered one.

Two days later, the same problem occurred. From the console, I was unable to run anything (mostly segfaults and ENOENT).
Again, I was able to reboot. This time, I dropped a couple LVMs, unmounted the swap partition, and resized my VG to make sure I had a fair amount of unallocated space on my faulty disk.

The problem persisted, while average uptime was significantly getting lower.
I progressively disabled local OSSEC daemon, puppet, a few crontabs, collectd, munin, … only keeping ceph, nagios and ssh running. The problem kept happening, every 12 to 48 hours.

This morning, the server wasn’t even able to boot.
Checking the BIOS, my root SSD wasn’t detected.
Attaching it to some USB dock, I had to wait a couples minutes before the disk was actually detected by my laptop (Ubuntu 14.04.02), and my desktop (Debian 7.8).
I caught a break when receiving my new disk at 11 AM.
Running dd from the faulty disk to the new one took around 50 minutes (20MB/s, I can’t believe it!).
Syncing (1x512G SSD, 2*4T & 1*3T HDD) after 8 hours of downtime, took around half an hour. Knowing I run a fairly busy mail server, some nntp index, …), this is a new tangible improvement brought by Hammer, over Firefly.

I’m now preparing to send the faulty disk to my re-seller, for replacement. At least, I would have one handy, for the next failure.

Morality: cheap is unreasonable. Better be lucky.

OSSEC

Looking for jobs on Elance and UpWork, I stumbled upon this proposal, quoting a blog post about hardening security on a small network, according to PCI standards.
Having already heard of Snort, Auditd, mod_security and Splunk, I was quite curious about OSSEC.

OSSEC purpose is to keep an eye on your systems integrity, and raise alerts upon suspicious changes.
Arguably, it could be compared to Filetraq, though it’s intelligent enough to qualify as an IDS.

Installation process is pretty straight-forward. The main difficulty is to properly create a certificate  for ossec-authd, the register all your nodes, and don’t forget to shut ossec-authd down, once you’re done deploying agents.
Using Wazuh packages (debian and ubuntu only), almost everything is pre-configured. They’re not perfect, if you happend to install both ossec-hids and ossec-hids-agent, a few files would be defined twice, upon installing the second packages the first one would be partially removed, you’ll lose files such as /etc/init.d/ossec, preventing the last package to install properly, … You’ll have to purge both packages, purge /var/ossec from your filesystem and reinstall either ossec-hids or ossec-hids-agent.

Setting it up on my kibana server, I ended up writing a module dealing with both agent and master server setup, as well as installing ossec-webui from github.
Note: the module does not deal with installing the initial key to your main ossec instance. As explained in the module README, you would need to install it on /var/ossec/etc prior to starting ossec service.
Passed that step, puppet would deal with everything else, including agent registration to your main instance.
Also note ossec-webui is not the only web frontend to OSSEC. There’s also Analogi I haven’t tried yet. Mainly because I don’t want to install yet another MySQL.

During my first tests, I noticed a small bug, cutting communications from an agent to its master.
More details about this over here. Checking /var/ossec/logs/ossec.log, you would find the ID corresponding to your faulty node. Stopping OSSEC, removing /var/ossec/queue/rids/$your_id and starting OSSEC back should be enough.

An other problem that could occur, is nodes showing up inactive on the web manager, while the agent seems properly running. Your manager logs would contain something like:

ossec-agentd(pid): ERROR: Duplicated counter for 'fqdn'.
ossec-agentd(pid): WARN: Problem receiving message from 10.42.X.X

When you would have identified something like this, you may then run /var/ossec/bin/manage_agents on your manager, and drop the existing key for incriminated agents. Then connect to your agents, drop /var/ossec/queues/rids/ content, stop ossec service, create a new key with /var/ossec/bin/agent-auth and restart ossec.

OSSEC

OSSEC Logs view

Stay tuned for my next commits, as I would be adding FreeBSD support, as soon as I would have build the corresponding package on my RPI.

OpenDKIM

OpenDKIM is the reference implementation dealing with DomainKeys Identified Mail (DKIM), maintained by the OpenDKIM Project, licensing under the new BSD License.

DKIM validation aims to prevent email spoofing by checking incoming mail from a remote domain for a DKIM header specifying a signature. The given signature should be validated against a DNS record, held by the remote domain, publishing its public key.
Having your mails properly signed increase your chances to avoid SPAM detection.

DKIM was first defined in RFC 4870, then superseded by RFC 4871, then updated by RFC 5672, and finally by RFC 6376.

Setting OpenDKIM up is a five minutes deal.
The whole setup process is best described by my last puppet module, also dealing with replicating private keys to all your signing SMTPs, and writing your public keys to your DNS and refreshing your zones configuration.

Pakiti

Pakiti is a patching status monitoring tool developed within the CERN, more recently took over by CESNET and an other CERN (Czech Educational and Research Network).

It is quite handy dealing with RHEL and debian-flavored operating systems.
Having tried to add FreeBSD/OpenBSD support, I can tell it is doable, although I haven’t succeeded yet.

pakiti dashboard

pakiti dashboard

Common pakiti setups involve a main server and several clients.
The main server usually has its own MySQL database. Periodically, it would retrieve update digests from configured repositories.
The client side consists on a small script, which main purpose is to list all packages locally installed and corresponding. Finishing its processing, the client connects to its master, send its report and eventually prints a list of outdated packages.

Obviously, the main advantage of this, over the direct alternatives (such as apt and apt_all plugins from munin) is that you don’t have to periodically retrieve latest packages headers from your repositories. The pakiti master does it once, sparing you what could become IO storms (assuming you have a physical host running around 20 VMs, having the apt_all plugins activated on all of them: statistically, you are running apt-get update once per minute).

The dark side of Pakiti, is the lack of official packages, updates, … The fact you need to patch, to have it work with latest PHP versions, …
Don’t expect running Pakiti right after unpacking their archive. Beware of their incomplete documentation. You might prefer to have a look at my puppet module, which prepares almost everything.

 

(2015/06/08) edit: discussing with Lafouine41 a few weeks ago, I misunderstood pakiti3 was ready to be released. After further investigations, I noticed the client wasn’t even Debian capable.
On a late night, I contributed the following patches:

Collectd

Collectd is yet another popular monitoring solution.

Collectd uses a single modular agent dealing with both metrics collection and re-transmission.

Like Munin, collectd uses RRD.
Unlike Munin, Collectd collects data locally (without connecting to a remote host: the agent generates its own RRD locally). Additionally, You may load a plugin (network) allowing both to forward your metrics to some remote host, and to gather metrics from remote peers.

Like Munin, Collectd comes with a list of common plugins.
Unlike Munin, Collectd plugins, are libraries. Creating one is harder. And mostly pointless, as Collectd provides with plugins allowing you to run a script.

Most recently, Collectd introduced a filter plugin based on iptables, allowing you to match metrics using regular expressions and handle them differently.

Collectd comes with a very basic web interface – collection3. You may try it, then would quickly try the first alternative you could get your hands on. A easy-to-setup collectd frontend being SickMuse, using bootstrap API.

Munin

Munin is a popular supervision solution mostly based on perl.

As of most supervision and monitoring solution, munin relies on an agent you would install on hosts to track metrics from, and a server gathering metrics and serving them via CGI, generating graphics from collected data.
As of most supervision and monitoring solutions, munin uses loadable probes (plugins) to extract metrics.

The server side consists of a crontab. Iterating on all your hosts, it would list and execute all available probes, store the results into RRD.
Over thousand hosts, you may need to export munin working root to some tmpfs storage, adding a crontab to regularly rsync the content from this tmpfs to a persistent directory. In any case, hosting your filesystem on some SSD is relevant.

The main advantage of munin over its alternatives, I think, is its simplicity.
Main concurrent here being Collectd, which default web front end is particularly ugly and unpractical. And alternate front ends such as SickMuse are nice-looking, but still lack from simplicity, while some require excessively redundant configuration before being able to present something comparable to munin, in terms of relevance and efficiency.

It is also very easy, to implement your own probes. From scripts to binaries, with a simple way to set contextual variables – evaluated when the plugin name matches the context expression – and requiring a single declaration on the server to collect all metrics served by a node: I don’t know of any supervision solution easier to deploy. And to incorporate to your orchestration infrastructure.

Finally, debugging probes is conceptually easy: all you need is telnet.

muninone:~# telnet 10.42.242.98 4949
Trying 10.42.242.98...
Connected to 10.42.242.98.
Escape character is '^]'.
# munin node at cronos.adm.intra.unetresgrossebite.com
list
cpu df load memory ntp_kernel_err ntp_kernel_pll_freq ntp_kernel_pll_off open_files processes users vmstat
fetch cpu
user.value 3337308
nice.value 39168
system.value 13389003
interrupt.value 6679408
idle.value 701005749
.
quit
Connection closed by foreign host.
muninone:~#

Here is a bunch of plugins I use, that you won’t find in default installations. Most of them being copied from public repositories such as github.

Let’s introduce the few probes I wrote, starting with pool_. You may use it to wrap any munin probe. Its main advantage being it consists on a single cat, it answers quickly, and prevent you from generating discontinuous graphics.
I used it in conjunction with the esx_ probe (found somewhere / patched, mostly not mine), which may take several seconds to complete. When a probe takes time to answer, munin times out, discarding data. Using a crontab pooling and keeping your metrics to temporary files independently from munin traditional metrics retrieval process, and using the pool_ probe to present these temporary file is a way to deal with these probes, requiring unusual processing time.

Less relevant, the freebox_ probe, for Freebox V5 (inspired by a similar probe, dealing with Freebox V6):

# munin node at cronos.adm.intra.unetresgrossebite.com
fetch freebox_atm
atm_down.value 6261
atm_up.value 1025
.
fetch freebox_attenuation
attenuation_down.value 44.00
attenuation_up.value 24.80
.
fetch freebox_snr
snr_down.value 6.40
snr_up.value 7.20
.
fetch freebox_status
status.value 1
.
fetch freebox_uptime
uptime.value 131.73
.

Multi Router Traffic Grapher

Also known as MRTG, this solution is perfect to generate graphs from SNMP checks.
MRTG would most likely be used to graph bandwidths and throughput: running their configuration generation utility (cfgmaker) without options would output everything related to your SNMP target network interfaces. Although, MRTG is virtually able to deal with anything your SNMP target can send back (disks, load, memory, …).

MRTG is very modular. Metrics may be stored in several formats, and viewed by several interfaces.
Shipped with its own html indexes generators, MRTG main task is to collect SNMP data.

Default web user interface is more of test page, than an actual supervision solution front page.
From there, several solutions came up, others grew, aiming to wrap MRTG greatness on some more-or-less powerful interfaces.
14all.cgi is one of the more popular frontend to MRTG, providing with a very simple, yet powerful, interface to your metrics.
router2.cgi looks more exhaustive on paper – I still haven’t tested it yet.
Solutions such as Cacti as well, may also use MRTG.

An other popular solution to use over MRTG, is PHP-Weathermap (not to confuse with Weathermap4rrds, their configuration syntax being almost identic, the latter lacks a few features over the original).
Derivatives may be found on OVH own network weathermap (by the way, my congratulations to the intern that most likely spend several days, configuring this map). You’ve got the idea: placing and connecting your objects, associating your rrdtool collections to specific links.

So far, I’ve mostly used 14all.cgi, which is awesome dealing with graphs – generating graphs by clicking specific points on an image is not possible, using generic MRTG interface.
Although, I’m still not convinced by the overall presentation of these graphs. While working for Smile, my manager was nagging me about this. I ended up writing my own client, re-using munin CSS and 14all.cgi graphs. Then, adding weathermap support. And finally, keeping a copy of generated maps to provide with an history presented to web clients.

ceph-dash

Having recently finished to re-install my cloud environments, I am now focusing on setting back up my supervision and monitoring services.

Last week, a friend of mine (Pierre-Edouard Gosseaume) told me about his experience with ceph-dash, a dashboard for Ceph I hadn’t heard from back then.
Like most ceph users, I’ve heard of Calamari. A languages, frameworks and technologies orgie I’ve ended up building by myself, and deploying on a test cluster I used to operate in Smile.
Calamari is sort of a fiasco. The whole stack gets fucked up by the underlying component: Saltstack.
Saltstack is yet another configuration deployment solution such as Puppet, Ansible or Rundeck.
Using Calamari, the calamari-server instance would use saltstack to communicate with its clients. As far as I could see, saltstack service randomly stops running on clients, until no one is responding to our server queries. A minute-based cron is required to keep your queries somewhat consistent. It’s a mess, I’ve never installed calamari on a prod cluster, and would recommend waiting at least for some pre-packaged release.

So, back to ceph-dash.
My first impression was mitiged, at best. Being distributed on github, by some “Crapworks“, I had my doubts.
On second thoughts, you can see they have a domain, crapworks.de. Deutsche kalität, maybe germans grant some hidden meaning to the crap thing, allright.

Again, there’s no package shipped. And as of Calamari, ceph-dash makefile allows you to build deb packages.
Unlike Calamari, ceph-dash is a very lightweight tool, based mostly on python, inducing low overhead, and able to run fully deported of your ceph cluster.
Even if the documentation tells you to install ceph-dash onto a MON host of your cluster, you may as well install it to some dedicated VM, as long as you do have installed the right librados, have a correct /etc/ceph/ceph.conf, and can use a valid keyring accessing the cluster.

ceph-dash ships with a small script running your service for tests purposes. It also ships with the necessary configuration for Apache integration, easily convertible to Nginx.
The zero-to-dashbord is done in about 5 minutes. Which again, is vastly different from my experience with Calamari.
The major novelty being, it actually works.