Menu

Fail2ban & ELK

Following up on a previous post regarding Kibana and ELK5 recent release, today we’ll follow up configuring some map visualizing hosts as Fail2ban blocks them.

Having installed Fail2ban and configured the few jails that are relevant for your system, look for Fail2ban log file path (variable logtarget, defined in /etc/fail2ban/fail2ban.conf, defaults to /var/log/fail2ban.log on debian).

The first thing we need is to have our Rsyslog processing Fail2ban logs. Here, we would use Rsyslog file input module (imfile), and force using FQDN instead of hostnames.

$PreserveFQDN on
module(load=”imfile” mode=”polling” PollingInterval=”10″)
input(type=”imfile”
  File=”/var/log/fail2ban.log”
  statefile=”/var/log/.fail2ban.log”
  Tag=”fail2ban: ”
  Severity=”info”
  Facility=”local5″)

Next, we’ll configure Rsyslog forwarding messages to some remote syslog proxy (which will, in turn, forward its messages to Logstash). Here, I usually rely on rsyslog RELP protocol (may require installing rsyslog-relp package), as it addresses some UDP flaws, without shipping with traditional TCP syslog limitations.

Relaying syslog messages to our syslog proxy: load in RELP output module, then make sure your Fail2ban logs would be relayed.

$ModLoad omrelp
local5.* :omrelp:rsyslog.example.com:6969

Before restarting Rsyslog, if it’s not already the case, make sure the remote system would accept your logs. You would need to load Rsyslog RELP input module. Make sure rsyslog-relp is installed, then add to your rsyslog configuration

$ModLoad imrelp
$InputRELPServerRun 6969

You should be able to restart Rsyslog on both your Fail2ban instance and syslog proxy.

Relaying messages to Logstash, we would be using JSON messages instead of traditional syslog formatting. To configure Rsyslog forwarding JSON messages to Logstash, we would use the following

template(name=”jsonfmt” type=”list” option.json=”on”) {
    constant(value=”{“)
    constant(value=”\”@timestamp\”:\””) property(name=”timereported” dateFormat=”rfc3339″)
    constant(value=”\”,\”@version\”:\”1″)
    constant(value=”\”,\”message\”:\””) property(name=”msg”)
    constant(value=”\”,\”@fields.host\”:\””) property(name=”hostname”)
    constant(value=”\”,\”@fields.severity\”:\””) property(name=”syslogseverity-text”)
    constant(value=”\”,\”@fields.facility\”:\””) property(name=”syslogfacility-text”)
    constant(value=”\”,\”@fields.programname\”:\””) property(name=”programname”)
    constant(value=”\”,\”@fields.procid\”:\””) property(name=”procid”)
    constant(value=”\”}\n”)
  }
local5.* @logstash.example.com:6968;jsonfmt

Configuring Logstash processing Fail2ban logs, we would first need to define a few patterns. Create some /etc/logstash/patterns directory. In there, create a file fail2ban.conf, with the following content:

F2B_DATE %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[ ]%{HOUR}:%{MINUTE}:%{SECOND}
F2B_ACTION (\w+)\.(\w+)\[%{POSINT:pid}\]:
F2B_JAIL \[(?\w+\-?\w+?)\]
F2B_LEVEL (?\w+)\s+

Then, configuring Logstash processing these messages, we would define an input dedicated to Fail2ban. Having tagged Fail2ban events, we will apply a Grok filter identifying blocked IPs and adding geo-location data. We’ll also include a sample output configuration, writing to ElasticSearch.

input {
  udp {
    codec => json
    port => 6968
    tags => [ “firewall” ]
  }
}

filter {
  if “firewall” in [tags] {
    grok {
      patterns_dir => “/etc/logstash/patterns”
      match => {
        “message” => [
          “%{F2B_DATE:date} %{F2B_ACTION} %{F2B_LEVEL:level} %{F2B_JAIL:jail} %{WORD:action} %{IP:ip} %{GREEDYDATA:msg}?”,
          “%{F2B_DATE:date} %{F2B_ACTION} %{WORD:level} %{F2B_JAIL:jail} %{WORD:action} %{IP:ip}”
        ]
      }
    }
    geoip {
      source => “ip”
      target => “geoip”
      database => “/etc/logstash/GeoLite2-City.mmdb”
      add_field => [ “[geoip][coordinates]”, “%{[geoip][longitude]}” ]
      add_field => [ “[geoip][coordinates]”, “%{[geoip][latitude]}” ]
    }
    mutate {
      convert => [ “[geoip][coordinates]”, “float” ]
    }
  }
}

output {
  if “_check_logstash” not in [tags] and “_grokparsefailure” not in [tags] {
    elasticsearch {
      hosts => [ “elasticsearch1.example.com”, “elasticsearch2.example.com”, “elasticsearch3.example.com” ]
      index => “rsyslog-%{+YYYY.MM.dd}”
    }
  }
  if “_grokparsefailure” in [tags] {
    file { path => “/var/log/logstash/failed-%{+YYYY-MM-dd}.log” }
  }
}

Note that if I would have recommended using RELP inputs last year, running Logstash 2.3: as of Logstash 5 this plugin is no longer available. Hence I would recommend setting up some Rsyslog proxy on your Logstash instance, in charge of relaying RELP messages as UDP ones to Logstash, through your loopback.

Moreover: assuming you would need to forward messages over a public or un-trusted network, then using Rsyslog RELP modules could be used with Stunnel encapsulation. Whereas running Debian, RELP output with TLS certificates does not seem to work as of today.

That being said, before restarting Logstash, if you didn’t already, make sure to define a template setting geoip type to geo_point (otherwise shows as string and won’t be usable defining maps). Create some index.json file with the following:

{
  “mappings”: {
    “_default_”: {
      “_all”: { “enabled”: true, “norms”: { “enabled”: false } },
      “dynamic_templates”: [
        { “template1”: { “mapping”: { “doc_values”: true, “ignore_above”: 1024, “index”: “not_analyzed”, “type”: “{dynamic_type}” }, “match”: “*” } }
      ],
      “properties”: {
        “@timestamp”: { “type”: “date” },
        “message”: { “type”: “string”, “index”: “analyzed” },
        “offset”: { “type”: “long”, “doc_values”: “true” },
        “geoip”: { “type” : “object”, “dynamic”: true, “properties” : { “location” : { “type” : “geo_point” } } }
      }
    }
  },
  “settings”: { “index.refresh_interval”: “5s” },
  “template”: “rsyslog-*”
}

kibana search

kibana search

Post your template to ElasticSearch

root@logstash# curl -XPUT http://elasticsearch.example.com:9200/_template/rsyslog?pretty -d@index.json

You may now restart Logstash. Check logs for potential errors, …

kibana map

kibana map

Now to configure Kibana: start by searching for Fail2ban logs. Save your search, so it can be re-used later on.

Then, in the Visualize panel, create a new Tile Map visualization, pick the search query you just saved. You’re being asked to select bucket type: click on Geo Coordinates. Insert a Label, click the Play icon to refresh the sample map on your right. Save your Visualization once you’re satisfied: these may be re-used defining Dashboards.

BlueMind Best Practices

Following up on a previous post introducing BlueMind, today we’ll focus on good practices (OpenDKIM, SPF, DMARC, ADSP, Spamassassin, firewalling) setting up their last available release (3.5.2) – and mail servers in general – while migrating my former setup (3.14).

A first requirement to ease up creating mailboxes, and manipulating passwords during the migration process, would be for your user accounts and mailing lists to be imported from some LDAP backend.
Assuming you do have a LDAP server, then you will need to create some service account in there, so BlueMind can bind. That account should have read access to your userPassword, pwdAccountLockedTime, entryUUID, entryDN, createTimestamp, modifyTimestamp and shadowLastChange attributes (assuming these do exist on your schema).
If you also want to configure distribution lists from LDAP groups, then you may want to load the dyngroup schema.

Another key to migrating your mail server would be to duplicate messages from one backend to another. Granted that your mailboxes already exist on both side, that some IMAP service is running on both sides, and that you can temporarily force your users password: then a tool you should consider is ImapSync. ImapSync can run for days, duplicate billions of mails, in thousands of folders, … Its simplicity is its strength.
Ideally, we would setup our new mail server without moving our MXs yet. The first ImapSync run could take days to complete. From there, next runs would only duplicate new messages and should go way faster: you may consider re-routing new messages to your new servers, continuing to run ImapSync until your former server stops receiving messages.

To give you an idea, my current setup involves less than 20 users, a little under 1.500.000 messages in about 400 folders, using around 30G of disk space. The initial ImapSync job ran for about 48 hours, the 3 next passes ran for less than one hour each.
A few years back, I did something similar involving a lot more mailboxes: in such cases, having some frontal postfix, routing messages to either one of your former and newer backend depending on the mailbox you’re reaching could be a big help in progressively migrating users from one to the other.

 

Now let’s dive into BlueMind configuration. We won’t cover the setup tasks as you’ll find these in BlueMind documentation.

Assuming you managed to install BlueMind, and that your users base is stored in some LDAP, make sure to install BlueMind LDAP import plugin:

apt-get install bm-plugin-admin-console-ldap-import bm-plugin-core-ldap-import

Note that if you were used to previous BlueMind releases, you will now need to grant each user with their accesses to your BlueMind services, including the webmail. By default, a newly created user account may only access its settings.
The management console would allow you to grant such permissions, there’s a lot of new stuff in this new release: multiple calendars per user, external calendars support, large files handling detached from webmail, …

 

The first thing we will configure is some firewall.
Note that Blue-Mind recommend to setup their service behind some kind of router, avoid exposing your instance directly to the Internet. Which is a good practice hosting pretty much everything anyway. Even though, you may want to setup some firewall.
Note that having your firewall up and running may lead to BlueMind installation script failing to complete: make sure to keep it down until you’re done with BlueMind installer.

On Debian, you may find the Firehol package to provide with an easy-to-configure firewall.
The service you would need to open for public access being smtp (TCP:25), http (TCP:80), imap (TCP:143), https (TCP:443), smtps (TCP:465), submission (TCP:587) and imaps (TCP:993).
Assuming firehol, your configuration would look like this:

mgt_ips=”1.2.3.4/32 2.3.4.5/32″
interface eth0 WAN
    protection strong
    server smtp accept
    server http accept
    server imap accept
    server https accept
    server smtps accept
    server submission accept
    server imaps accept
    server custom openssh “tcp/1234” default accept src “$mgt_ips”
    server custom munin “tcp/4949” default accept src “$mgt_ips”
    server custom nrpe “tcp/5666” default accept src “$mgt_ips”
    client all accept

You may then restart your firewall. To be safe, you could restart BlueMind as well.

Optionally, you may want to use something like Fail2ban, also available on Debian. You may not be able to track all abusive accesses, although you could lock out SMTP authentication brute-forces at the very least, which is still relevant.

Note that BlueMind also provides with a plugin you could install from the packages cache extracted during BlueMind installation:

apt-get install bm-plugin-core-password-bruteforce

 

The second thing we would do is to install some valid certificate.  These days, services like LetsEncrypt would issue free x509 certificates.
Still assuming Debian, a LetsEncrypt client is available in jessie-backports: certbot. This client would either need access to some directory served by your webserver, or would need to bind on your TCP port 443, so that LetsEncrypt may validate the common name you are requesting actually routes back to the server issuing this request. In the later case, we would do something like this:

# certbot certonly –standalone –text –email me@example.com –agree-tos –domain mail.example.com –renew-by-default

Having figured out how you’ll generate your certificate, we will now want to configure BlueMind services loading it in place of the self-signed one generated during installation:

# cp -p /etc/ssl/certs/bm_cert.pem /root/bm_cert.pem.orig
# cat /etc/letsencrypt/live/$CN/privkey.pem /etc/letsencrypt/live/$CN/fullchain.pem >/etc/ssl/certs/bm_cert.pem

Additionally, you will want to edit postfix configuration using LetsEncrypt certificate chain. In /etc/postfix/main.cf, look for smtpd_tls_CAfile and set it to /etc/letsencrypt/live/$CN/chain.pem.

You may now reload or restart postfix (smtps) and bm-nginx (both https & imaps).

Note that LetsEncrypt certificates are valid for 3 months. You’ll probably want to install some cron job renewing your certificate, updating /etc/ssl/certs/bm_cert.pem then reloading postfix and bm-nginx.

 

The next thing we would configure is OpenDKIM signing our outbound messages and validating signatures of inbound messages.
Debian has some opendkim package embedding everything you’ll need on a mail relay. Generating keys, you will also need to install opendkim-tools.

Create some directory layout and your keys:

# cd /etc
# mkdir dkim.d
# cd dkim.d
# mkdir keys keys/example1.com keys/example2.com keys/exampleN.com
# for d in example1 example2 exampleN; do \
    ( cd keys/$d.com; opendkim-genkey -r -d $d.com ); done
# chmod 0640 */default.private
# chown root:opendkim */default.private

In each of /etc/dkim.d/keys subdir, you will find a default.txt file that contains the DNS record you should add to the corresponding zone. Its content would look like the following:

default._domainkey IN TXT “v=DKIM1; k=rsa; p=PUBLICKEYDATA”

You should have these DNS records ready prior to having configured Postfix signing your messages.
Having generated our keys, we still need to configure OpenDKIM signing messages. On Debian, the main configuration is /etc/opendkim.conf, and should contain something like this:

Syslog yes
UMask 002
OversignHeaders From
KeyTable /etc/dkim.d/KeyTable
SigningTable /etc/dkim.d/SigningTable
ExternalIgnoreList /etc/dkim.d/TrustedHosts
InternalHosts /etc/dkim.d/TrustedHosts

You would need to create the 3 files we’re referring to in there, the first one being /etc/dkim.d/KeyTable:

default._domainkey.example1.com example1.com:default:/etc/dkim.d/keys/example1.com/default.private
default._domainkey.example2.com example2.com:default:/etc/dkim.d/keys/example2.com/default.private
default._domainkey.exampleN.com exampleN.com:default:/etc/dkim.d/keys/exampleN.com/default.private

The second one is /etc/dkim.d/SigningTable and would contain something like this:

example1.com default._domainkey.example1.com
example2.com default._domainkey.example2.com
exampleN.com default._domainkey.exampleN.com

And the last one, /etc/dkim.d/TrustedHosts would contain a list of the subnets we should sign messages for, such as

1.2.3.4/32
2.3.4.5/32
127.0.0.1
localhost

Now, let’s ensure OpenDKIM would start on boot, editing /etc/default/opendkim with following:

DAEMON_OPTS=
SOCKET=”inet:12345@localhost”

Having started OpenDKIM and made sure the service is properly listening on TCP:12345, you may now configure Postfix relaying its messages to OpenDKIM. Edit your /etc/postfix/main.cf adding the following:

milter_default_action = accept
smtpd_milters = inet:localhost:12345
non_smtpd_milters = inet:localhost:12345

Restart Postfix, make sure mail delivery works. Assuming you can already send outbound messages, make sure your DKIM signature appears to be valid for other mail providers (an obvious one being gmail).

 

Next, we’ll configure SPF validation of inbound messages.
On Debian, you would need to install postfix-policyd-spf-perl.

Let’s edit /etc/postfix/master.cf, adding a service validating inbound messages matches sender’s domain SPF policy:

spfcheck unix – n n – 0 spawn
    user=policyd-spf argv=/usr/sbin/postfix-policyd-spf-perl

Next, edit /etc/postfix/main.cf, look for smtpd_recipient_restrictions. The last directive should be a reject_unauth_destination, and should precede the policy check we want to add:

smtpd_recipient_restrictions=permit_sasl_authenticated,
    permit_mynetworks,reject_unauth_destination,
    check_policy_service unix:private/policyd-spf

Restart Postfix, make sure you can still properly receive messages. Checked messages should now include some Received-SPF header.

 

Finally, we’ll configure SPAM checks and Spamassassin database training.
On Debian, you’ll need to install spamassassin.

Let’s edit /etc/spamassassin/local.cf defining a couple trusted IPs, and configuring Spamassassin to rewrite the subject for messages detected as SPAM:

rewrite_header Subject [ SPAM _SCORE_ ]
trusted_networks 1.2.3.4/32 2.3.4.5/32
score ALL_TRUSTED -5
required_score 2.0
use_bayes 1
bayes_auto_learn 1
bayes_path /root/.spamassassin/bayes
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Flag
ifplugin Mail::SpamAssassin::Plugin::Shortcircuit
shortcircuit ALL_TRUSTED on
shortcircuit BAYES_99 spam
shortcircuit BAYES_00 ham
endif

Configure Spamassassin service defaults in /etc/default/spamassassin:

CRON=1
ENABLED=1
NICE=”–nicelevel 15″
OPTIONS=”–create-prefs –max-children 5 -H /var/log/spamassassin -s /var/log/spamassassin/spamd.log”
PIDFILE=/var/run/spamd.pid

Make sure /root/.spamassassin and /var/log/spamassassin both exist.

Now let’s configure Spamassassin to lear from BlueMind SPAM folders content, create or edit /etc/spamassassin/sa-learn-cyrus.conf with the following content:

[global]
tmp_dir = /tmp
lock_file = /var/lock/sa-learn-cyrus.lock
verbose = 1
simulate = no
log_with_tag = yes

[mailbox]
include_list = ”
include_regexp = ‘.*’
exclude_list = ”
exclude_regexp = ”
spam_folder = ‘Junk’
ham_folder = ‘Inbox’
remove_spam = yes
remove_ham = no

[sa]
debug = no
site_config_path = /etc/spamassassin
learn_cmd = /usr/bin/sa-learn
bayes_storage = berkely
prefs_file = /etc/spamassassin/local.cf
fix_db_permissions = yes
user = mail
group = mail
sync_once = yes
virtual_config_dir = ”

[imap]
base_dir = /var/spool/cyrus/example_com/domain/e/example.com/
initial_letter = yes
domains = ”
unixhierarchysep = no
purge_cmd = /usr/lib/cyrus/bin/ipurge
user = cyrus

Look out for sa-learn-cyrus script. Note that Debian provides with a package with that name, that would pull cyrus as a dependency – which is definitely something you want on a BlueMind server.
Run this script to train Spamassassin from the messages in your Inboxes and Junk folders. Eventually, you could want to install some cron job.

Start or restart Spamassassin service. Now, let’s configure Postfix piping its messages to Spamassassin. Edit /etc/postfix/master.cf, add the following service:

spamassassin unix – n n – – pipe
    user=debian-spamd argv=/usr/bin/spamc -f -e
    /usr/sbin/sendmail -oi -f ${sender} ${recipient}

Still in /etc/postfix/master.cf, locate the smtp service, and add a content_filter option pointing to our spamassassin service:

smtp      inet  n       –       n       –       –       smtpd
    -o content_filter=spamassassin

Restart Postfix. Sending or receiving messages, you should read about spamd in /var/log/mail.log. Moreover, a X-Spam-Checker-Version header should show in your messages.

 

Prior to migrating your messages, make sure to mount /var/spool/cyrus on a separate device, and /var/backups/bluemind on some NFS share, LUN device, sshfs, …. something remote, ideally.
Your /tmp may be mounted from tmpfs, and could use the nosuid and nodev options – although you can not set noexec.

Assuming you have some monitoring system running: make sure to keep an eye on mail queues, smtp service availability or disk space among others, …

 

Being done migrating your setup, the last touch would be to set proper SPF, ADSP and DMARC policies (DNS records).

Your SPF record defines which IPs may issue mail on behalf of your domain. Usually, you just want to allow your MXs (mx). Maybe you’ll want to trust some additional record, … And deny everyone else (-all). Final record could look like this:

@ IN TXT “v=spf1 mx a:non-mx-sender.example.com -all”

Having defined your SPF policy, and assuming you properly configured DKIM signing, while corresponding public key is published in your DNS, then you may consider defining some DMARC policy as well.

@ IN TXT v=DMARC1; p=quarantine; pct=100; rua=mailto:postmaster@example.com

And ADSP:

_adsp._domainkey IN TXT “dkim=all;”

… and that’s pretty much it. The rest’s up to you, and would probably be doable from BlueMind administration console.

UniFi

While upgrading my network, I stumbled upon some UniFi access point a friend left over, a few months ago. It’s been lying there long enough, today I’m powering it on.

UniFi are Ubiquiti devices. If you are not familiar with these, Ubiquiti aims to provide with network devices (switch, router, network cameras, access points) that are controlled by a centralized web service. Once a device is paired with your controller, the VLANs, SSIDs, … you’ve already defined would be automatically deployed. Although I’m not at all familiar with their routers, switches or network cameras, I already worked with their access points covering large offices with several devices, roaming, VLANs, freeradius, automated firmware upgrades, PoE, … These access points can do it all, have a pretty good range of action in comparison with the world-famous WRT54-G, and can serve 100s of users simultaneously.

site settings

site settings

Unifi is also the name of the controller you would install on some virtual machine, managing your devices. Having installed their package on your system, you should be able to connect on your instance’s port 8443 using https. Once you’ld have created your administrative user, then a default SSID: you would be able to discover devices in your network, and pair them to your controller.

The Unifi controller allows you to define a syslog host, SNMP community string, … And also shows a checkbox to “enable DPI”, which requires an USG (Unifi Security Gateway). Pretty much everything is configured from this panel.

map

Map

The Unifi controller would also allow you to import maps of the locations you’re covering. I could recommend Sweet Home 3D to quickly draw such map. On a small setup like mine, this is mostly bling-bling, still it’s pretty nice to have.

events

events

The controller provides with a centralized log, archiving events such as firmware updates, clients connections, when a user experienced with some interference, … Pretty exhaustive.

client

client

The controller also offers with detailed statistics per SSID, access point, client device, … allow tagging or blocking hardware addresses.

stats

stats

Unifi intends to provide with enterprise-grade devices, and yet their cheapest access points prices are comparable to the kind of SoHo solutions we all have running our home network.

Even though you would never use most of the features of their controller, and even if you do not need to deploy several access points covering your area, UniFi access points can still be relevant for their ease of use, features set and shiny manager. And especially if your setup involves several devices with similar configurations: you’ll definitely want to give UniFi a shot.

Kibana

Kibana Map

Kibana Map

Kibana is the visible part of a much larger stack, based on ElasticSearch, a distributed search engine based on Lucene, providing with full-text searches via some restful HTTP web service.

Kibana used to be a JavaScript application running from your browser, querying some ElasticSearch cluster and rendering graphs, maps, … listing, counting, … A couple versions ago, kibana was rewritten as a NodeJS service. Nowadays, Kibana can even be used behind a proxy, allowing you to configure some authentication, make your service public, …

ElasticSearch on the other hand, is based on Java. Latest versions would only work with Java 8. We won’t make an exhaustive changes list, as this is not our topic, although we could mention that ElasticSearch 5 no longer uses their multicast discovery protocol, and instead defaults to unicast. You may find plugins such as X-Pack, that would let you plug your cluster authentication to some LDAP or AD backend, configuring Role-Based Access Control, … the kind of stuff that used to require a Gold plan with Elastic.co. And much more …

One of the most popular setup involving Kibana also includes Logstash, which arguably isn’t really necessary. An alternative to “ELK” (ElasticSearch Logstash Kibana) could be to replace your Logstash by Rsyslog, leveraging its ElasticSearch output module (om-elasticsearch). Then again, Logstash can be relevant assuming you need to apply transformations to your logs (obfuscating passwords, generating geo -locations data from an IP address, …). Among other changes, Logstash 5 requires GeoIP databases to comply with the new MaxMind DB binary format.

Kibana Filters

Kibana Filters

All of these projects are evolving quickly – even syslog, whose 8th version doesn’t look anything like the version 5, that used to be on the previous debian LTS.
Having reinstalled a brand new cluster to celebrate, I had to rewrite my puppet modules deploying ElasticSearch and Kibana, pretty much everything changed, variables renamed, new repositories, introducing Kibana debian packages, …

Kibana as well, was subject to major changes: panels are no longer loaded from Kibana yaml configuration file: something very similar to the previous default dashboard is loaded and you may install additional plugins, as you may already do customizing Logstash or ElasticSearch. Meanwhile, Kibana is finally shipped as a debian archive, corresponding init scripts are properly installed, … Way cleaner, rewriting this puppet module was a pleasure.

Kibana Discovery

Kibana Discovery

Note that the test setup I am currently running is hosted on a tiny Kimsufi box, 4G RAM, Atom 2800 CPU, 40G disk. This server runs ElasticSearch, Kibana, Logstash – and Rsyslog. Such RAM or CPU is kinda small, for ELK. Still, I must say I’m pretty satisfied so far. Kibana dashboard loads in a couple seconds, which is way faster than my recollections running single-node ELKs with previous versions, and definitely faster than running a 3-nodes ELK-4 cluster for a customer of mine.

Bacula

Bacula is a pretty complete backup solution, written in C++, based on three main daemons (file, storage, director) allowing to schedule jobs and restore backups from a given point in time.

Bacula agents

Bacula agents

The director service is in charge of orchestrating all operations (start a backup, start a restore, …) and would store its indexes into some RDBMS (MySQL, PostgreSQL or SQLite). This is where we would be defining our retention policy and schedule our jobs. Bacula also provides with a Console client – as well as various graphical alternatives – that will connect to your director, allowing you to restore files, check for jobs and processes statuses, logs, …

The storage service is in charge of storing and serving backups. Nowadays mostly used with regular drives, NFS or LUN devices, yet can still be configured working with tapes.

slack-notify

slack-notify

The file service is to be installed on all the servers you intend to backup. Upon director request, it would be in charge of collecting and compressing (if configured to do so) the data you’re backing up, before sending it to your storage service.

Backup jobs may be of type Full (backup everything), Differential (capturing changes since last Full backup) or Incremental (capturing changes since last Full, Differential or Incremental backup), which should allow you to minimize disk space usage.

Assuming you can externalize your older backup volumes to off-line disks from time to time, there is no limit to your retention.

bacula-dashboard

bacula-dashboard

Jobs definition may include running commands prior to or after having run your backup, you may also define commands to be run upon failed backup, … director can be configured to limit how much simultaneous jobs should be running, the last versions of bacula also include bandwidth limitations. All in all, making Bacula a very complete and scalable solution.

Bacula wiki lists several graphical clients supervising your backups. As of right now, I preferred sticking to a read-only dashboard, installed on my director instance, with bacula-web (PHP/gd).

CircleCI & Ansible

About a year ago, I started writing some Ansible roles in order to bootstrap the various services involved in hosting Peerio backend.

Running on EC2, using Ansible was an easy way to quickly get my new instances up and running. Moreover, Ansible comes with a few plugins that would allow you to define Route53 records and health checks. Or could even be used creating EC2 Instances or Auto Scale Groups.

Using AWS’s contextualization (USER_DATA), I’m able to pass variables to my booting instances, so that each new machine pulls its copy of our ansible repository, generates a few yamls including “localization” variables (which redis, riak, … http cache should we be using). Then, the boot process applies our main ansible playbook, setting up whatever service was requested in our USER_DATA.

One could argue that running on EC2, I could have prepared distinct templates for each class of service. Which would imply having instances that boot faster. Although the counterpart is that I would need to manage more templates, may forget to update some file somewhere, … Whereas right now, I only need to maintain a couple AMIs (a Jessie image, for pretty much all our services, and a Trusty one, for Riak & RiakCS: despite paying for enterprise support, we’re still unable to get our hands on some package for Debian Jessie or Ubuntu Xenial).

An other advantage of being able to bootstrap our backend using Ansible, is that whenever we would want to assist customers in setting up their own cluster, we do already have reproducible recipes – with some relatively exhaustive documentation.

As of Puppet, Chef, … or any orchestration solution, Ansible is a very powerful tool coding your way through automation. But whenever you write some module, or end up forking and deploying some external project to your setup, you would be subject to potential failures, most of the time syntax errors related, maybe some poorly-set permission, …
Deploying instances, you’re either root or able to assume such privileges, restart a webserver, install packages, replace critical configurations on your systems, … Anything could happen, everything could break.

An other source of risks, in addition to self-authored pebkacs, is depending on external resources. Which could be apt packages, ruby gems, python packages, node modules, …. Even though we can usually trust our system’s official repositories to serve us with a consistent set of packages, whenever we assume we would be able to install some external dependency at boot time, when you need your service to be available already: we are placing a bet, and time would eventually prove us wrong.

A good example of something that is likely to break in my setup these days, is Netdata. This project changes a lot on GitHub.
At some point, their installation script started to source some profile file that was, in some unconfigured instances of mine, running some ‘exec bash’. Moving netdata module inclusion after my shell role got applied was enough to fix.
More recently, their systemctl init script tells me Netdata can not start due some read-only file system. I finally had to replace their init, somehow.
Eventually, I may just patch my Ansible roles pulling Netdata from my own fork on GitHub.

Another sample of something that’s likely to break is NPMJS’s registry. Either for some package that disappeared due to internal quarrels, or their cache being likely to server 502 depending on your AWS region, or some nodejs module that got updated and some code that used to depend on it can not load the newer version, …

Preventing such errors – or spotting them as soon as possible – is crucial, when your setup relies on disposable instances that should auto-configure themselves on boot.
Which brings us to our main topic: CircleCI. I’ve been discussing this service on a former post, CircleCI is an easy way to keep an eye on your GitHub repositories.

CircleCI configuration running Ansible

CircleCI configuration running Ansible

CircleCI default image is some Ubuntu Precise on steroids, pretty much everything that exists in official or popular repositories would be installed, you’ld be able to pick your NodeJS, Python, Ruby, … versions. And lately, I’ve noticed that we’re even able to use Docker.

As usual, a very short circle.yml is enough to get us started.

Makefile rules spawning docker containers

Makefile rules spawning docker containers

Note that our docker containers are started from a Makefile, which makes it easier to use colons, mapping internal to external ports (colons being a Yaml delimiter).

We also add our own ~/.ssh/config, so our Ansible inventory can refer to hostnames instead of several occurrences of 127.0.0.1 with different client ports. Somehow, the latter seems to disturb Ansible.

CircleCI applying our Ansible cuisine

CircleCI applying our Ansible cuisine

Our current configuration would start 9 instances, allowing us to apply our ansible roles against different configurations (riak node, redis node, loadbalancer, nodejs worker, …).

The whole process, from CircleCI instance start to our nine docker containers being configured, then running a few checks ensuring our services behave as expected, takes about 25 minutes. Bearing in mind CircleCI free plans usually won’t let you run your tests for more than 30 minutes.

Note that from time to time, installing your packages from httpredir.debian.org would end up in connections getting closed, and some of your images may fail building. I’m not sure yet how to address these.

I would also like to expose nagios & munin ports, having CircleCI running checks as our internal monitoring would have done. Still, we’re already running a few wgets, redis and s3cmd based verifications, making sure our main services are behaving as expected.

 

Now I should say I’m usually skeptical regarding Docker relevance, as I’m mostly dealing with production setups. I’ve been testing Docker a few times on laptops, without being much impressed – sure, the filesystem layers thing is kind of nice. Then again, I’m more likely to trust Xen’s para-virtualization these days.

It's so complex and cool!

It’s not stupid at all! Pinkerton, you explain the logic and I’ll provide the background

Eventually, I ended up installing Docker on some KVM guest I use building Debian & Ubuntu source packages.
Running Docker on CircleCI felt like an obvious thing to do – and you probably already have heard of someone using Docker in conjunction with Jenkins or whatever CI chain.
I wouldn’t recommend nesting virtualization technologies, although this may be one of the rare cases such practice won’t potentially hurt your production, while helping you in ensuring your deployment cuisine actually does what you’re expecting of it, whatever service you would be deploying.

Reweighting Ceph

For those familiar with the earlier versions of Ceph, you may be familiar with that process, as objects were not necessarily evenly distributed across the storage nodes of your cluster. Nowadays, and since somewhere around Firefly and Hammer, the default placement algorithm is way more effective on that matter.

Still, after a year running what started as a five hosts/three MONs/18 OSDs cluster, and grew up to eight hosts and 29 OSDs, two of them out – pending replacement – it finally happened, one of my disks usage went up to over 90%:

Ceph disks usage, before reweight

Ceph disks usage, before reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.7T 982G 74% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.7T 174G 91% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 1002G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 67M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 73M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 718G 62% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 778G 73% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.0T 749G 74% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 493G 434G 54% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 953G 75% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.2T 1.5T 61% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.1T 1.6T 57% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 61% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.5T 1.3T 55% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

It was time to do something:

ceph:~# ceph osd reweight-by-utilization
SUCCESSFUL reweight-by-utilization: average 0.653007, overload 0.783608. reweighted: osd.23 [1.000000 -> 0.720123]

about two hours later, all fixed:

Ceph disks usage, after reweight

Ceph disks usage, after reweight

/dev/sdd1 442G 202G 241G 46% /var/lib/ceph/osd/ceph-24
/dev/sdb1 3.7T 2.8T 904G 76% /var/lib/ceph/osd/ceph-22
/dev/sdc1 1.9T 1.2T 638G 66% /var/lib/ceph/osd/ceph-23
/dev/sde1 3.7T 2.7T 976G 74% /var/lib/ceph/osd/ceph-7
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-5
/dev/sdb1 472G 69M 472G 1% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 75M 3.7T 1% /var/lib/ceph/osd/ceph-6
/dev/sdc1 1.9T 1.2T 666G 65% /var/lib/ceph/osd/ceph-20
/dev/sdb1 2.8T 2.0T 830G 71% /var/lib/ceph/osd/ceph-19
/dev/sda1 442G 183G 260G 42% /var/lib/ceph/osd/ceph-18
/dev/sdd1 2.8T 2.1T 696G 76% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 518G 409G 56% /var/lib/ceph/osd/ceph-17
/dev/sda1 1.9T 1.2T 717G 62% /var/lib/ceph/osd/ceph-15
/dev/sdb1 927G 519G 408G 56% /var/lib/ceph/osd/ceph-16
/dev/sda1 461G 324G 137G 71% /var/lib/ceph/osd/ceph-8
/dev/sdb1 3.7T 2.8T 928G 76% /var/lib/ceph/osd/ceph-9
/dev/sdc1 3.7T 2.3T 1.4T 62% /var/lib/ceph/osd/ceph-10
/dev/sdd1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-11
/dev/sdd1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-3
/dev/sdb1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-1
/dev/sda1 472G 306G 166G 65% /var/lib/ceph/osd/ceph-0
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-2
/dev/sda1 461G 219G 242G 48% /var/lib/ceph/osd/ceph-25
/dev/sdb1 2.8T 1.7T 1.1T 62% /var/lib/ceph/osd/ceph-26
/dev/sdc1 3.7T 2.5T 1.2T 69% /var/lib/ceph/osd/ceph-27
/dev/sdd1 2.8T 1.6T 1.3T 56% /var/lib/ceph/osd/ceph-28
/dev/sdc1 927G 696G 231G 76% /var/lib/ceph/osd/ceph-14
/dev/sda1 1.9T 1.1T 798G 58% /var/lib/ceph/osd/ceph-12
/dev/sdb1 2.8T 2.0T 804G 72% /var/lib/ceph/osd/ceph-13

one year of ceph

one year of ceph

I can’t speak for IO-intensive cases, although as far as I’ve seen the process of reweighting an OSD or repairing damaged placement groups blends pretty well with my usual workload.
Then again, Ceph provides with ways to priorize operations (such as backfill or recovery), allowing you to fine tune your cluster, using commands such as:

# ceph tell osd.* injectargs ‘–osd-max-backfills 1’
# ceph tell osd.* injectargs ‘–osd-max-recovery-threads 1’
# ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’
# ceph tell osd.* injectargs ‘–osd-client-op-priority 63’
# ceph tell osd.* injectargs ‘–osd-recovery-max-active 1’

While on the subject, last screenshot to celebrate one year running Ceph and OpenNebula, illustrating how much crap I can hoard.

Woozweb, Uptime Robot, StatusCake

Today, a post on a service that closed today, and investigating on potential replacements.

woozweb

woozweb

In the last few years, I worked for Smile, a french Open Source integrator. Among other things, Smile hosted Woozweb, a free service allowing you to define HTTP checks, firing mail notifications.

Since I left Smile, I’ve opened an account on Woozweb, and used it looking after public services I manage, checking them from outside my facilities.
Two days ago, I received a mail from some Smile’s manager, notifying me that Woozweb would be shut down on May 13th. As of writing these lines (around 4 am), the site is indeed closed.

Such sites may seem stupid, or incomplete. And sure, the service those provide is really limited.
Yet when your monitoring setup is located in the same vLAN, or some network connected to the service you are monitoring, you should keep in mind your view on this service is not necessarily what your remote users would experience with. Hence, third-party services could stumble upon failures your own setup won’t even suspect.

Now Woozweb wasn’t perfect. Outdated web interface, outdated nagios probes (that failed establishing ssl handshake against my tlsv1.1/tlsv1.2 only services), 10 checks limitation, never got a response from their support, … But it did the job, allowed string matches, graphed response times, used to warn me when those reached a threshold, …

uptimeRobot

Uptime Robot dashboard

In the last couple days, I’ve been trying out alternatives to their service. There’s quite a lot of them, such as Pingdom. We’ll focus on free services, allowing https checks and triggering mail notifications.

The first I did test and could recommend is Uptime Robot.

Their interface is pretty nice and original. Service is free as long as you can stick to 50 checks with a 5 minutes interval, don’t need SMS notifications and can bear with 2 months of logs retention.

uptimeRobot

Uptime Robot check view

Defining checks is relatively easy, first results show up pretty quickly, no trouble checking tlsv1.1/tlsv1.2-only services. Already received an alert for a 1 minute outage, that my Icinga setup also warned me about.

Compared to Woozweb, the features are slightly better, whereas the web interface is definitely more pleasant. Yet there is no data regarding where those queries were issued from, and their paid plan page doesn’t mention geo-based checks – which is usually the kind of criteria we could look for, relying on such services.

StatusCake

StatusCake dashboard

Not being completely satisfied, I looked for an other candidate and ended up trying out StatusCake.

Again, their site is pretty agreeable. Those used to CircleCI would recognize the navigation bar and support button. Free plan includes an unlimited amount of checks, as long as 5 minutes granularity is enough, and does involve running checks from random locations – whereas paid plans would allow you to pick from “60+” locations (according to their pricing page, while their site also tells about servers in over 30 countries and 100 locations around the world).

StatusCake

StatusCake check view

Defining checks is pretty easy. I liked the idea of being forced to define a contact group – which would allow you to change the list of recipient alerts should be send to, for several checks at once. Yet the feature that definitely convinced me with Slack integration.
So even if you do not want to pay for a plan including SMS notifications, you could receive notifications on your phone using Slack.
Everything’s not perfect though: string matches are only allowed using paid plans. This kind of feature is pretty basic, … On the bright side, status-code based filtering is nicely done.

The check view confirms your service is monitored from various locations. It is maybe a little less appealing than Uptime Robot, but the Slack integration beats everything.

Another big advantage StatusCake has is their “Public Reporting” capabilities. I’m not sure I would use it right now, as I already wrote a small shell-script based website, serving as public reporting dashboard, that I host outside of our production setup.

Bearing in mind these service won’t exempt you from setting up some in-depth and exhaustive monitoring of your resources, they still are a nice addition. Sexy dashboards definitely help – I wouldn’t have shown Woozweb screenshots, as their UI was amazingly despicable.
I’ll probably keep using both Uptime Robot and StatusCake.

DMARC

Today we will discuss DMARC, a relatively new standard, considering the aging protocol it applies to.

DMARC stands for Domain-based Message Authentication Reporting and Conformance. It relies on a couple older standards: DKIM (discussed here) and SPF. Having properly configured both, setting up DMARC is just a formality.

DMARC can be used to audit your mail system, getting reports on who sends messages and where are they sent from. Although DMARC’s main purpose is more likely to address phishing. Then again, as for DKIM or SPF, DMARC’s effectiveness is strongly bound to its adoption.

DMARC relies on a TXT record, defining a policy for your domain (not to be confused with SPF), which would ultimately instruct remote SMTP servers on how to treat messages not matching “sender alignment” (the From field of your envelope, the From field of your headers and your DKIM signature must match). Additionally, you could also request for reports to be send back to some third-party mailbox.

Our TXT record would be a semicolon-separated concatenation of “tags”. The first couple tags being mandatory, and several others optional:

  • v is the (required) protocol version, usually v=DMARC1
  • p is the (required) policy for your domain, can be p=reject, p=none or p=quarantine
  • sp is the policy that should be applied for messages sent by sub-domains of the zone you are configuring. Defaults to your global (p) policy, although you may not want it so (p=quarantine;sp=reject)
  • rua is the address where aggregate DMARC reports should be sent to, would look like rua=mailto:admins@example.com, note that if the report receiver’s mailbox is not served within the domain you are defining this DMARC tag in, there is some additional DNS record to defined on the receiver’s end
  • ruf is the (optional) address where forensic DMARC reports should be sent to, works pretty much as rua does
  • rf is the format for failure reports. Defaults to AFRF (rf=afrf) which is defined by RFC5965, can be set to IODEF (rf=iodef) which is defined by RFC5070.
  • ri is the amount of seconds to wait between sending aggregate reports. Defaults to 86400 (ri=86400), sending a report per day.
  • fo instruct the receiving MTA on what kind of reports are expected from the sender’s side. Defaults to 0 (fo=0) which triggers a report if both DKIM and SPF checks fail. Can be a set to a 1 (fo=1), sending reports if any of DKIM and SPF checks fail. Can be set to d (fo=d) to send reports if DKIM check failed or s (fo=s) if SPF check failed. May be set to a colon-separated concatenation of values (fo=d:s).
  • pct is the percentage of messages that should be processed according to your DMARC policy, can be used to gradually adopt DMARC. Defaults to 100 (pct=100)
  • adkim defines how to check for sender alignment. Defaults to relaxed (adkim=r), meaning that as long as your sender address’s domain matches the DKIM domain, or any of its sub-domain, your message will match. Can be set to strict (adkim=s), to ensure your sender’s domain is an exact match for your DKIM signature.
  • aspf defaults to relaxed (aspf=r) which allows you to use distinct sub-domain, setting the From field of your envelope and your headers. Can be set to strict (aspf=s) ensuring these match.

In most cases, the defaults would suit you. Better define reporting address though, and make sure you’ll receive alerts for both SPF and DKIM errors. A minimalist record would look like:

$ORIGIN example.com.
_dmarc TXT "v=DMARC1;p=quarantine;fo=1;rua=mailto:monitoring@example.com;ruf=mailto:admins@example.com"

Ultimately, you may want to drop unexpected sub-domain communications as well:
$ORIGIN example.com.
_dmarc TXT "v=DMARC1;p=quarantine;sp=drop;adkim=s;aspf=s;fo=1;rua=mailto:monitoring@example.com;ruf=mailto:admins@example.com"

Note that in their support, Google recommends slow adoption.

Slack

slack global view

slack global view

As promised in a previous post, here is a long due introduction of Slack and its wonders.

On paper, Slack is a messaging solution built to ease up communication between team members. In addition to their main dashboard, as shown on the right, you will find Android and iOS applications, that would allow you to always stay in touch with your team.

As far as I’ve seen, you may create as much channels as you want. Users may pick the channels they want to subscribe to, and would be invited to join if their name is mentioned in channels they’re not listening to already.

Notifications are based on your being here. If you do allow desktop notifications on your browser, you may get notified when someone talks about you. Or you may define filters, if you want to track specific words, or a whole channel. You may configure different notification schemes on your browser and phone, … The whole thing is pretty well thought, and realized.

Slack Snippet

Slack Snippet

Slack Code Block

Slack Code Block

One can easily insert short code blocks into conversations. Or large blocks may be uploaded as snippets, that can be expanded, reviewed separately or comment upon.

An other key of Slack is the considerable amount of integrations. Among them: Google Hangouts, Skype, GitHub, CircleCI, or the “WebHook” (allowing you to post your notifications to an URL).

slack codedeploy

slack codedeploy

There’s no definite use case for Slack. I’ve been using it with coworkers I never met with, and with a bunch of friends I went to school with. Whereas one of these friends also uses it with coworkers he daily sees.

slack notifications

slack notifications

Even if code is not your concern, or if you’re alone and want to reduce the amount of SPAM your scheduled jobs may generate: Slack can be interesting. Running backups, monitoring deployments, or even receiving notifications from some monitoring system, Slack is a powerful tool that may just change your life.

The only negative observation I could make is that, from time to time, I get disconnected from all my groups. I get reconnected automatically, usually in less than a minute. Although sometimes, I’m really locked out, I can’t connect to any of my groups via my Orange fiber. In the meantime, I can see that other services in AWS’s us-east-1 are unreachable (such as a pair of HAproxy I have – which I may reach through CloudFlare, which acts as a frontend for my proxies). No problem from my phone, switching the WiFi off, using Free. I guess there’s some peering issues, which, sadly, occurs almost daily.