Menu

2016 upgrade

Quick post sharing a few pictures I took this February, as I finally replaced my plastic shelf by some steel rack.

I took that opportunity to add an UPS, a third PDU, my 7th & 8th Ceph hosts.
Thanks to Ceph & OpenNebula, I haven’t had to shut down any of my services.

Bonjour Serveurs (suite)

PatchDashboard

Having used pakiti by the past, and intending to setup some version tracking system to keep an eye on a setup I started managing recently, I was looking for an alternative that would work out of the box – which is my main grudge against pakiti2, I have a large set of patches, scripts, database init dump, which makes it pretty impractical to setup from scratch.

Investigating on the subject, I stumbled upon several rpm-based solutions, such as PulpProject, SpaceWalk or its successor: Katello.

PatchDashboard frontpage

PatchDashboard frontpage

Although the setup I’m working with today is build around Ubuntu, and a couple Debian. Which is how I ended up trying out PatchDashboard.

There’s a couple stuff I noticed during install on Debian jessie. A detail regarding http proxies if you have such. And ultimately, a certificate error on their website.
I also sent a few pull requests dealing with php errors, mostly cosmetic [1, 23, 4, 5, 6], some more practical [78, 9, 10], and adding Debian jessie to their database init script [11, 12].

Having dealt with that, everything works pretty much as expected.

PatchDashboard list vulnerable packages

PatchDashboard list vulnerable packages

Next stuff I added to my setup, is some patch listing installed nodejs packages, and known vulnerabilities (according to snyk).

I’m not sure these would be relevant enough to send a pull request, this is a whole new feature, that only makes sense for those hosting nodejs-based apps.
Note I did publish the corresponding changes, yet only to my fork on github,

Datadog

About a couple months ago, I started working for Peerio. I’ll probably make an other post introducing them better, as the product is pretty exciting, the client has been open sourced, mobile clients will be released soon, code is audited by Cure53
Point being, since then, I’m dealing with our hosting provider, pretty much auditing their work.

One problem we had was the monitoring setup. After a couple weeks asking, I finally got an explanation as for how they were monitoring service, and why they systematically missed service outages.
Discovering the “monitoring” provided was based on a PING check every 5 minutes, and a SNMP disk check: I reminded our provider that our contract specifically tells about an http check matching a string, and so far we had to do that check ourselves.

After a few days of reflection, our provider contacted us back, proposing to register our instances to datadog and setup monitoring from there.
My first reaction, discovering Datadog, is looks a little bit like graphite, collectd. To some extend, even munin. Although, Datadog meta tags tells about traps, monitoring, … Why not. Still, note that the free version only allows to register 5 hosts.

I’ll skip the part where our provider fails to configure our http check, and end up graphing the response time of our haproxy, regardless of the reponse code (200, 301, 502, nevermind, haproxy answers), while the http check on nginx fails (getting https://localhost/, with the certificate check option set to true). When replacing our production servers, I shut down the haproxy service from our former production instances, to realize datadog was complaining about failing to parse the plugin output before disabling the plugin on the corresponding hosts. Epic.
We are just about to get rid of them, for breach of contract. I already have my nagios (icinga2) based monitoring. But I’m still intrigued: a monitoring solution that may effectively distinguish failing instances from nodes being replaced in an AutoScale Group could be pretty interesting, hosting our service on AWS.

Datadog Infrastructure View

Datadog Infrastructure View

The first thing we could say about Datadog, is that it’s “easy” – and based on python.
Easy to install, easy to use, easy to configure checks, easy to make sense out of data. Pretty nice dashboards, in comparison to collectd or munin. I haven’t looked at their whole list of “integrations” yet (integration: module you enable from your dashboard, provided that your agents forward metrics this plugin may use generating graphs or sending alerts), though the ones activated so far are pretty much what we could hope for.

Let’s review a few of them, starting with the haproxy dashboard:

Datadog HAproxy dashboard

Datadog HAproxy dashboard

The only thing I’m not sure to understand yet, is how to look up the amount of current sessions, which is something relatively easy to set up using munin:

Munin haproxy sessions rate

Munin HAproxy sessions rate

 

 

 

Still, datadog metrics are relevant, the ‘2xx’ div is arguably more interesting than  some ‘error rate’ graph. And over all: these dashboards aggregate data for all your setup – unless configured otherwise. More comparable to collectd/graphite on that matter, than munin where I have to use separate tabs.

Datadog Nginx dashboard

Datadog Nginx dashboard

Datadog Nginx integration comes with two dashboards. We’ll only show one, the other one is less verbose, with pretty much the same data.

Again, I counting dropped connections instead of showing some “all-zeroes” graph is arguably more interesting, and definitely easier to read.

Datadog ElastiCache dashboard

Datadog ElastiCache dashboard

We won’t show them all. In our case, the Riak integration is pretty interesting as well. A few weeks ago, we were still using the Redis integration – since then, we moved to ElastiCache to avoid having to manage our cluster ourselves.

Datadog EC2 dashboard

Datadog EC2 dashboard

One of Datadog selling argument is that they are meant to deal with AWS. There’s a bunch of other plugins looking for ELB metrics, DynamoDB, ECS, RDS, … We won’t use most of them, though their EC2 plugin is interesting tracking overall resources usage.

In the end, I still haven’t answered my initial question: can datadog be used to alert us upon service failures. I have now two nagios servers, both sending SMS alerts, I’m not concerned about monitoring any more. Still it might be interesting to give it an other look later.

A downside I haven’t mentioned yet, is that your instances running Datadog agent need to connect to internet. Which may lead you to setting up HTTP proxies, with eventually some ELB or route53 internal records, to avoid adding a SPOF.
Or, as our provider did, attach public IPs to all their instances.

Although this problem might not affect you, as you may not need to install their agent at all. When using AWS, Datadog lists all your resources – depending on ACLs you may set in your IAM role. Installing their agent on your instances, the corresponding AWS metrics would be hidden from Datadog dashboard, superseded by those sent through your agent. Otherwise, you still have some insight on CPU consumption, disk and network usage, …
The main reason you could have to install an agent is to start collecting metrics for haproxy, riak or nginx, something AWS won’t know of.

Datadog is definitely modular. Pretty straight forward integration with AWS. Steep learning curve.
But paying for these is and having no alerts so far, while our hosting provider spent weeks setting it up, no alerts on output plugins failures either: I’m not completely convinced yet.
Still, I can’t argue their product is interesting, and especially relevant to monitoring or supervision neophytes.

 


 

Edit: a few days later, leaving our provider, we decided to abandon Datadog as well.

Granted that our former provider was contractually obliged to provide us with a monitoring solution, I was hoping for them to pay the bill – if any. They were the ones supposed to monitor our setup: whatever the solution, as long as they could make it work, I would have been fine with it.

Datadog cheapest paying plan starts at 15$/instance/month. Considering we have about 40 instances, throwing 600$/month for about a hundred graphs made no sense at all. Hosting our own nagios/munin on small instances is indubitably cheaper. Which makes me think the only thing you pay here, is the ease of use.
And at some point, our former provider probably realized this, as they contacted my manager when I started registering our DR setup to Datadog, inquiring on the necessity of looking after these instances.

As of today, we’re still missing everything that could justify the “monitoring” branding of Datadog, despite our provider attending a training on that matter, then spending a couple weeks with nothing much to show for it.
The only metrics munin was not graphing yet have been added to our riak memory probe.

Icinga2

Before talking about Icinga, we might need to introduce Nagios: a popular monitoring solution, released in 1999, which became a standard over the years. The core concept is pretty simple: alerting upon services failure, according to user-defined check commands, schedules, notification and escalation policies.

You’ll find Nagios clients on pretty much all systems – meaning Windows. You may use SNMP to check devices or receive SNMP traps. You may use passive or active checks. You’ll probably stumble upon NRPE, Nagios Remote Plugin Executor and/or NSCA, Nagios Service Check Acceptor.

You’ve got the idea: Nagios is pretty modular. Writing plugins is relatively easy. Monitoring your infrastructure is made relatively easy, and you may automate pretty much everything using puppet classes or ansible roles.
Still, around 2009, two forks appeard: Icinga and Shinken, released in march and december respectively.
It could be a coincidence, though it most likely was the result of several disputes opposing Nagios Core contributors and developpers to the project maintainer, Ethan Galstad.
We won’t feed that troll, you may find more data about that over here.

Anyway, in early 2009, Icinga came out. Quickly after, Shinken joined them.
Both products advertised you may migrate your whole monitoring infrastructure from Nagios, just by stopping nagios daemon and starting either Shinken or Icinga one, using pretty much the same configuration.

Icinga1.11

Icinga1.11 monitoring intra.unetresgrossebite.com

Icinga1.11

Icinga1.11 monitoring unetresgrossebite.com public services, as well as some of my customers services

So far, so good.
I’ve used Shinken once, a little less than a year. I wasn’t very much satisfied – too many workers, inducing some overhead, not really relevant for a small infrastructure like mine. It’s not really our topic. Meanwhile, I’ve been using Icinga for a couple years, I’ve installed tens of them, starting with version 1.7 to 1.9, working for Smile. All in all, I’m pretty satisfied. I’m still running a couple Icinga monitoring my own services, it’s nice, they have a web interface that does not require you to install some SGBD (for some reason, included in icinga-doc package on debian).

Icinga2

Icinga2 & Icingaweb2 monitoring aws.peerio.com

A while ago, Icinga released Icinga2. Sounded promising, until I realized they completely changed the way to configure their server, making my current templates and puppet classes useless.
Arguably, Icinga2 is not a nagios server anymore. Which is not necessarily a criticism.

This week, working on AWS for Peerio, I installed my first Icinga2 setup, writing ansible roles to automate NRPE servers configuration, Icinga2 configuration and probes registration to my nagios servers, SMS alerts using Twilio, mail alerts using sendgrid, using http proxies everywhere – no direct internet access, on AWS private instances.
There’s no public link to my gitlab, exceptionally, though I expect the repository to be opened to the public pretty soon on github.

Icinga2 Alert Summary

Point being, I’ve finally seen what Icinga2 is worth, and it’s pretty satisfying.
Upgrading from your former nagios server will be relatively more complicated than migrating from Nagios to Icinga, or Nagios to Shinken. You’ll probably want to start from a fresh VM, starting from scratch.
Icinga2 configuration could look strange, especially after using nagios syntax for years, but it makes sense, and could drastically reduce the amount of configuration you’ll need.

Icinga2 Timeline

Icinga2 Timeline

I’m a little less convinced by their new web interface, Icingaweb2. For their defense, I had to download the code on github, the RC1 was released a few months ago. No bug so far, maybe some CSS stuff to fix (currently, the “Add New Pane or Dashboard” button is too small, I can only read half of the text), still pretty impressive for something pulled from github with no official packaged release.
I’m missing the statusmaps. The report interface looks sexier than in the previous version, though I can’t find how to generate detailed reports.

Looking at their “Timeline” view, I don’t remember seeing anything like that on their previous version. Why not.

The project is still young, I guess it could still be in some todolist. At least, the features represented work perfectly, their interface is pretty nice, pretty dynamic, without being some over-bloated list of tables like I was used to dealing with Shinken, Icinga or Thruk.

Munin I/O latencies

Dealing with large amounts of hosts and probes to graph, or sometimes just disks with poor performances, you may see munin loading and your system struggling with I/O waits.

It’s not the first time I have to deal with such problems. I’ve used tmpfs and hourly crontabs saving rrd data, SSD, RAID 0+1, … There’s lots of ways, to mitigate with such troubles.
Though today, I decided to have an other look on google, and found out about rrdcached.

Switching to rrdcached on an existing munin setup is pretty straight forward: install the package, feed the service the proper options to use with munin. The you’ll need to update your munin.conf, setting rrdcached_socket to the socket created by rrdcached. That’s it.
From there, you could consider updating munin jobs, so that munin-html and munin-graph are not run every 5 minutes, which would drastically lower your I/O. Alternatively, you may mount your munin www directory using tmpfs: your rrd data remain either cached or written on disk, thus on the next munin-html/munin-graph job, your munin DocumentRoot is completely rewritten anyway.

Munin RRDCache

Munin RRDCache – vmstat

Munin RRDCache

Munin RRDCache – CPU usage

Munin RRDCache

Munin RRDCache – load average

Munin RRDCache

Munin RRDCache – processes priority

Munin RRDCache

Munin RRDCache – diskstats

Munin RRDCache

Munin RRDCache – disk latency

Scaling out with Ceph

A few months ago, I installed a Ceph cluster hosting disk images, for my OpenNebula cloud.
This cluster is based on 5 ProLian N54L, each with a 60G SSD for the main filesystems, some with 1 512G SSD OSD, all with 3 disk drives from 1 to 4T. SSD are grouped in a pool, HDD in an other.

OpenNebula Datastores View - before

OpenNebula Datastores View, having 5 Ceph OSD hosts

Now that most my services are in this cluster, I’m left with very few free space.
The good news is there is no significant impact on performances, as I was experiencing with ZFS.
The bad news, is that I urgently need to add some storage space.

Last Sunday, I ordered my sixth N54L on eBay (from my “official” refurbish-er, BargainHardware) and a few disks.
After receiving everything, I installed the latest Ubuntu LTS (Trusty) from my PXE, installed puppet, prepared everything, … In about an hour, I was ready to add my disks.

I use a custom crush map, and the osd “crush update on start” set to false, in my ceph.conf.
This was the first time I tested this, and I was pleased to see I can run ceph-deploy to prepare my OSD, without automatically adding it to the default CRUSH root – especially having two pools.
From my ceph-deploy host (some Xen PV I use hosting ceph-dash, munin and nagios probes related to ceph, but with no OSD nor MON actually running), I ran the following:

# ceph-deploy install erebe
# ceph-deploy disk list erebe
# ceph-deploy disk zap erebe:sda
# ceph-deploy disk zap erebe:sdb
# ceph-deploy disk zap erebe:sdc
# ceph-deploy disk zap erebe:sdd
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sda
# ceph-deploy osd prepare erebe:sdb
# ceph-deploy osd prepare erebe:sdc
# ceph-deploy osd prepare erebe:sdd

At that point, the 4 new OSD were up and running according to ceph status, though no data was assigned to them.
Next step was to update my crushmap, including these new OSDs in the proper root.

# ceph osd getcrushmap -o compiled_crush
# crushtool -d compiled_crush -o plain_crush
# vi plain_crush
# crushtool -c plain_crush -o new-crush
# ceph osd setcrushmap -i new-crush

For the record, the content of my current crush map is the following:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host nyx-hdd {
id -2 # do not change unnecessarily
# weight 10.890
alg straw
hash 0 # rjenkins1
item osd.1 weight 3.630
item osd.2 weight 3.630
item osd.3 weight 3.630
}
host eos-hdd {
id -3 # do not change unnecessarily
# weight 11.430
alg straw
hash 0 # rjenkins1
item osd.5 weight 3.630
item osd.6 weight 3.900
item osd.7 weight 3.900
}
host hemara-hdd {
id -4 # do not change unnecessarily
# weight 9.980
alg straw
hash 0 # rjenkins1
item osd.9 weight 3.630
item osd.10 weight 3.630
item osd.11 weight 2.720
}
host selene-hdd {
id -5 # do not change unnecessarily
# weight 5.430
alg straw
hash 0 # rjenkins1
item osd.12 weight 1.810
item osd.13 weight 2.720
item osd.14 weight 0.900
}
host helios-hdd {
id -6 # do not change unnecessarily
# weight 3.050
alg straw
hash 0 # rjenkins1
item osd.15 weight 1.600
item osd.16 weight 0.700
item osd.17 weight 0.750
}
host erebe-hdd {
id -7 # do not change unnecessarily
# weight 7.250
alg straw
hash 0 # rjenkins1
item osd.19 weight 2.720
item osd.20 weight 1.810
item osd.21 weight 2.720
}
root hdd {
id -1 # do not change unnecessarily
# weight 40.780
alg straw
hash 0 # rjenkins1
item nyx-hdd weight 10.890
item eos-hdd weight 11.430
item hemara-hdd weight 9.980
item selene-hdd weight 5.430
item helios-hdd weight 3.050
item erebe-hdd weight 7.250
}
host nyx-ssd {
id -42 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.460
}
host eos-ssd {
id -43 # do not change unnecessarily
# weight 0.460
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.460
}
host hemara-ssd {
id -44 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.450
}
host erebe-ssd {
id -45 # do not change unnecessarily
# weight 0.450
alg straw
hash 0 # rjenkins1
item osd.18 weight 0.450
}
root ssd {
id -41 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item nyx-ssd weight 1.000
item eos-ssd weight 1.000
item hemara-ssd weight 1.000
item erebe-ssd weight 1.000
}
# rules
rule hdd {
ruleset 0
type replicated
min_size 1
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}
# end crush map

Applying the new crush map, a 20 hours process started, moving placement groups.

OpenNebula Datastores View - after

OpenNebula Datastores View, having 6 Ceph OSD hosts

# ceph-diskspace
/dev/sdc1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-6
/dev/sda1 472G 330G 143G 70% /var/lib/ceph/osd/ceph-4
/dev/sdb1 3.7T 2.4T 1.4T 64% /var/lib/ceph/osd/ceph-5
/dev/sdd1 3.7T 2.4T 1.4T 65% /var/lib/ceph/osd/ceph-7
/dev/sda1 442G 329G 114G 75% /var/lib/ceph/osd/ceph-18
/dev/sdb1 2.8T 2.1T 668G 77% /var/lib/ceph/osd/ceph-19
/dev/sdc1 1.9T 1.3T 593G 69% /var/lib/ceph/osd/ceph-20
/dev/sdd1 2.8T 2.0T 808G 72% /var/lib/ceph/osd/ceph-21
/dev/sdc1 927G 562G 365G 61% /var/lib/ceph/osd/ceph-17
/dev/sdb1 927G 564G 363G 61% /var/lib/ceph/osd/ceph-16
/dev/sda1 1.9T 1.2T 630G 67% /var/lib/ceph/osd/ceph-15
/dev/sdb1 3.7T 2.8T 935G 75% /var/lib/ceph/osd/ceph-9
/dev/sdd1 2.8T 1.4T 1.4T 50% /var/lib/ceph/osd/ceph-11
/dev/sda1 461G 274G 187G 60% /var/lib/ceph/osd/ceph-8
/dev/sdc1 3.7T 2.2T 1.5T 60% /var/lib/ceph/osd/ceph-10
/dev/sdc1 3.7T 1.9T 1.8T 52% /var/lib/ceph/osd/ceph-1
/dev/sde1 3.7T 2.0T 1.7T 54% /var/lib/ceph/osd/ceph-3
/dev/sdd1 3.7T 2.3T 1.5T 62% /var/lib/ceph/osd/ceph-2
/dev/sdb1 472G 308G 165G 66% /var/lib/ceph/osd/ceph-0
/dev/sdb1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-12
/dev/sdd1 927G 580G 347G 63% /var/lib/ceph/osd/ceph-14
/dev/sdc1 2.8T 2.0T 813G 71% /var/lib/ceph/osd/ceph-13

I’m really satisfied by the way ceph is constantly improving their product.
Having discussed with several interviewers in the last few weeks, I’m still having to explain why ceph rbd is not to be confused with cephfs, and if the latter may not be production ready, rados storage is just the thing you could be looking for distributing your storage.

Why UneTresGrosseBite.com?

Lately, I’ve been asked a lot about my domain name. during job interviews mostly.
And I can understand why it might seems shocking, at first sight.

About 5 years ago, I was living with a roommate, which registered the domain unegrossebite.com.
It was kind of funny, to have a custom PTR record.
A colleague of mine registered for whatabigdick.com.
When my roommate left, I had to subscribe for my own ADSL, ended up registering my own domain as well. And went further, with unetresgrossebite.com.

Over time, there’s one observation I could make: these kind of domains, are most likely to be targeted, by botnets, people scanning your sites with no respect for your robots.txt, …

A perfect example illustrating this would be my DNS services.
It all started with a single dedicated server, hosted by Leaseweb, where I hosted several services. One of these being bind.
It was my first DNS server, I made a lot of mistakes such as allowing recursion or permissive ACLs. It went very bad, very quickly. I was receiving lots of ANY requests, generating from 10 to 50Mb/s targeted to a few IPs.
Fixing bind configuration and adding hexstring-based rules to my firewall helped, though attacks kept going for months.
Over time, I subscribed for an other dedicated server with Illiad, and noticed both Illiad and Leaseweb provide with free zones caching services: having a server, you may define several domains of yours in their manager, and ask for their replication.
Basically: using split-horizon, I am able to serve internal clients with my own DNS servers, and to push a public view of my zones to Illiad and Leaseweb DNS servers. The public view is set so Illiad and Leaseweb are both authoritative name servers, serving my zones to unidentified clients. I configured a firewall on my public DNS servers to prevent unknown clients from using them. Now, Illiad and Leaseweb are both dealing with my attacks, I don’t have to bother identifying legitimate queries any more.
And it makes perfect sense. Even if one could want to host their own domains, protecting yourself from DNS amplification attacks requires reverse-path checking at least, Arbor, Tilera, … Hosting providers, with their own physical network, hardware and peering are most likely to block these attacks.

In general, fail2ban is a good candidate mitigating attacks from the server side.
As long as your application generates log, you may parse them to identify and lock out suspicious clients.
Hosting SSH servers, asterisk, unbound/bind or even wordpress, you have a lot to gain from fail2ban filters.
Lately, I’ve even used fail2ban to feed csf/lfd, instead of setting iptables rules by itself.

Back to our topic: why unetresgrossebite.com?
Despite obvious compensating remarks, dealing with these kind of domains is pretty informative.
I could sell out, and register for some respectable domain name. Though sticking to this one keeps me busy andforces me to implement best practices.

OpenNebula

OpenNebula 4.10 dashboard

OpenNebula 4.10 dashboard, running on 4-compute 5-store cluster

This could have been the first article of this blog. OpenNebula is a modular cloud-oriented solution that could be compared to OpenStack, driving heterogeneous infrastructure, orchestrating storage, network and hypervisors configuration.

In the last 7 months, I’ve been using OpenNebula with Ceph to virtualize my main services, such as my mail server (200GB storage), my nntp index (200GB mysql DB, 300GB data), my wiki, plex, sabnzbd, … pretty much everything, except my DHCP, DNS, web cache and LDAP services.
A few before leaving Smile, I also used OpenNebula and Ceph to store our system logs, involving Elasticsearch, Kibana and rsyslog-om-elasticsearch (right: no logstash).

This week, some customer of mine was asking for a solution that would allow him to host several Cpanel VPS, knowing he already had a site dealing with customer accounts and billing. After refusing to use my scripts deploying Xen or KVM virtual machines, as well as some Proxmox-based setup, we ended up talking about OpenNebula.

OpenNebula 4.12 dashboard

OpenNebula 4.12 dashboard, running on a single SoYouStart host

The service is based on a single SoYouStart dedicated host, 32GB RAM, 2x2T disks and a few public IPs.
Sadly, OpenNebula is still not available for Debian Jessie. Trying to install Wheezy packages, I met with some dependency issues, regarding libxmlrpc. In the end, I reinstalled the server with the latest Wheezy.

From there, installing Sunstone, OpenNebula host utils, registering localhost to my compute nodes and my LVM to my datastores took a couple hours.
Then, I started installing centos7 using virt-install and vnc, building cpanel, installing csf, adding my scripts configuring network according to nebula context media, … the cloud was operational five hours after Wheezy was installed.
I finished by writing some training support (15 pages, mostly screenshots) explaining the few actions required to create a VM for a new customer, suspend his account, backup his disks, and eventually purge his resources.

OpenNebula VNC view

OpenNebula VNC view

At first glance, using OpenNebula to drive virtualization services on a single host could seem overkill, to say the least.
Though having a customer that don’t want to know what a shell looks like, and when even Proxmox is not an acceptable answer, I feel confident OpenNebula could be way more useful than we give it credit for.

Crawlers

Hosting public site, you’ve dealt with them already.
Using your favorite search engine, you’re indirectly subject to their work as well.
Crawlers are bots, querying and sometimes mapping your site, completing some search engine database.

When I started looking at the subject, a few years back, you only had to know about /robots.txt, to potentially prevent your site from being indexed, or at least restrict such accesses to relevant contents of your site.
More recently, we’ve seen the introduction of some XML files such as sitemap, allowing to efficiently serve to search engines a map of your site.
This year, Google reviewed his “rules” to prioritize responsive and platform-optimized sites as well. As such, they are now recommending to allow crawling for JavaScript and CSS files, warning –threatening– that preventing these accesses could result in your ranking being lowered.

At this point, indexing scripts and style-sheets, you might say – I’m surprised not to find the remark in the comments – that google actually indexes your site vulnerabilities, creating not only a directory of the known internet, but a complete map with everyones’ hidden pipes, that could some day be used to breach your site – if not already.

Even if Google is the major actor on that matter, you probably have dealt with Yahoo, Yandex.ru, MJ12, Voltron, … which practices are similar. Over the last years, checking your web server logs, you might have noticed a significant increase in the proportion of bot queries over human visits. In part due to search engines recrudescence, though I suspect mostly thanks to bot nets.
Identifying these crawlers could be done checking the UserAgent, sent with all http requests. On small traffic sites, crawlers may very well be your only clients.

Assuming your sites are subject to DDOS attacks, scans for some software vulnerability (top targeted solutions being wordpress, phpbb and phpmyadmin), you should know attackers will eventually masquerade their user-agent. Most likely branding themselves as Googlebot.

To guarantee a “googlebot” branded query actually comes out some google server, you just need to check the pointer record associated to this client’s IP. A way to do so in Apache (2.4) could be to use something like this (PoC to complete/experiment).
Still, maybe is it wiser to just drop all google requests as well. Without encouraging Tor usage, it’s probably time to switch to DuckDuckGo?

Otherwise, an other good way to deny these connections is described here, I may try to add something like this to my puppet classes.

Don’t trust the Tahr

Beware that since latest Ubuntu kernel upgrades (14.04.02), you may lose network rebooting your servers!

I’ve had the problem four days ago, rebooting one of my OpenNebula hosts. Still unreachable after 5 minutes, I logged in physically, to see all my “p1pX” and “p4pX” interfaces had disappeared.
Checking udev rules, there is now a file fixing interfaces mapping. On a server I have not rebooted yet, this file doesn’t exist.

The story could have ended here. But with Ubuntu, updates is a daily struggle: today, one of my ceph OSD (hosting 4 disks) spontaneously stopped working.
Meaning: the host was still there, I was able to open a shell using SSH. Checking processes, all ceph osd deamon were stopped. Starting them showed no error, while processes were still absent. Checking dmesg, I had several lines of SSL-related segfaults.
As expected, rebooting fixed everything, from ceph, to my network interfaces names.
It’s in these days I most enjoy freelancing: I can address my system and network outages in time, way before it’s too late.

While I was starting to accept Ubuntu as safe enough to run production services, renaming interfaces on a production system is unacceptable. I’m curious to know how Canonical dealt with that providing BootStack and OpenStack-based services.

Note there is still a way to prevent your interfaces from being renamed:

# ln -s /dev/null /etc/udev/rules.d/75-persistent-net-generator.rules