April | 2015 | BOFH meditations

DCF77

April 27, 2015

0 Comments

This is no WMD, but a German radio station transmitting current date, for time synchronization purposes.

I first stumbled upon DCF77 while working for CamTrace.
Some of our customers insisted on not connecting their video-surveillance network to the Internet.

The first solution coming to mind is to set up some NTP server. Which would work, most of the time.
Although, you should know NTP uses a stratum, keeping track of how high you are in the food chain. Having a lower value is critical, to allow other devices to synchronize their clock against yours.
Our consumers may restart their server, not fully understanding those concepts. But a client configuring his camera right after booting his server is not wrong. Thus, a good way ensuring your stratum is low enough so that devices may always accept what your NTP server is answering, is to get your time from a reliable source. Such as some german-kalität radio clock. Note this solution is only viable in the 2.000kms around Frankfurt though.

I’m mostly familiar with Axis networks cameras, but assume the problem presents itself on anything not embedding some RTC (like Raspberry Pi on default configuration, module being sold separately) : setting the date installing the device is a mistake, knowing it would fluctuate relatively quickly (worst cases, we’re talking about a few minutes per day, quickly adding up to hours, …).

Thus, I’ve combined some Raspberry Pi to a MouseCLOCK USB. NTP allows interfacing with such clock. A minimalistic /etc/ntpd.conf would look like this:

server 127.127.8.0 mode 19 fudge 127.127.8.0 stratum 0 fudge 127.127.8.0 time1 0.042

Using 127.127.8.0 as source tells ntp to look for /dev/refclock-0 device. Assuming FreeBSD, your clock device would show up as /dev/cuaU$x. Linux, /dev/ttyUSB$x. You may add to ntp startup the following lines:

if /usr/local/sbin/lsusb | grep mouseCLOCK >/dev/null; then if test -c /dev/cuaU0; then test -h /dev/refclock-0 || ln -s /dev/cuaU0 /dev/refclock-0 fi fi

A cleaner way to create this link could be using devfs. In FreeBSD, add to /etc/devfs.conf the line cuau0 refclock-0.
On Debian, use udevd creating a file in /dev/udev/rules.d matching your device, like KERNEL==”ttyUSB*”, ATTRS{idProduct}==”e88a”, SYMLINK+=”refclock-%m”.

After a while, querying ntp should confirm proper synchronization:

root@cronos:/usr/ports # ntpq -crv -pn assID=0 status=0264 leap_none, sync_lf_clock, 6 events, event_peer/strat_chg, version="ntpd 4.2.4p5-a (1)", processor="arm", system="FreeBSD/10.1-STABLE", leap=00, stratum=1, precision=-19, rootdelay=0.000, rootdispersion=1026.460, peer=25445, refid=DCFa, reftime=d8e8e24f.1ec18939 Mon, Apr 27 2015 18:29:03.120, poll=6, clock=d8e8e25d.bb054ad7 Mon, Apr 27 2015 18:29:17.730, state=4, offset=-77.794, frequency=472.839, jitter=10.929, noise=27.504, stability=0.197, tai=0 remote refid st t when poll reach delay offset jitter ============================================================================== *127.127.8.0 .DCFa. 0 l 14 64 1 0.000 -77.794 10.929 root@cronos:/usr/ports #

Jessie, out

April 26, 2015

0 Comments

Samuel MARTIN MORO

Yesterday, Debian announced the release of Jessie as their new stable release, after 24 months of development and patches

First and foremost, Jessie comes with systemd – sysvinit being still available.
Let’s review the fundamentals, before unleashing the troll:

Kernel 3.16
LibVirt 1.2.9
Xen 4.4
Vzctl 4.8
Asterisk 11.13
Puppet 3.7
Apache 2.4.10
Bind 9.9
GCC 4.9
PHP 5.6
PostgreSQL 9.4
MySQL 5.5.43
Samba 4.1

Now let’s go back on Systemd, and to Lennard Poettering “gifts” to the community, in general.

Systemd is controversial for many reasons. At first, it was about re-inventing the wheel.
Systemd is supposed to replace sysvinit scripts – powering most of your Unix distributions since 20 years or so.
Although, systemd relies on Linux Kernel specifics, and thus is not portable to BSD. On that subject, Poettering tells us even if his solution would be portable, BSD users won’t switch to it, so there’s no problem in creating a sysvinit alternative targeted to Linux systems (which feels like it could be true, but also sounds like racism), and anyway, porting it would require time investigating on alternatives (see #13 and #15 fom Lennard response to its detractors).
The thing is, systemd does not only replace sysvinit, but also manage file system mounting, power and devices management, disk encryption, network configuration, sessions management, GPT partitions discovery, … Some may argue that Systemd goes against Unix philosophy of “doing one little thing, and do it well”.

From the practical point of view, systemd stores its logs using journald, and binary files. Logs are thus corruptible, and can’t be manipulated as easily as traditional plain-text files.
Being dependend on Linux kernel, using a perticular version of systemd implies running a compatible version of Linux kernel.
Systemd differs from sysvinit by being non-deterministic, non-predictible. Its process is hidden within init, risking to bring down the whole system by crashing. Meanwhile, a plenty of non-kernel system upgrades would now require you to restart your system anyway – tastes like Windows, feels like Windows, …

So why trying to fix something that is working? Why main distributions all seems to switch to systemd, when there is other replacement such as openRC, or even sysvinit?

By nature, integrating and re-designing Unix core components such as udev or dbus unilaterally involve both having all components systematically installed (when a server does not automatically need them), as well as their interface being progressively and obscurely rewritten, providing new interfaces. At some point, GNOME started using logind from systemd, to authenticate users. Setting up GNOME without systemd became acrobatic. More and more distributions are forced to integrate systemd.

Systemd could be described as a system daemon, sometimes referred as a “basic userspace building block to make an OS from”, engulfing features formerly attributed to a syslog daemon, your network or device managers, … Systemd’s spread is symbolic, showing a radical shift in thinking within the Linux community: desktop-oriented, choice-limiting, isolationist. Collateral victim being Debian/kFreeBSD, guilty of not using the holy Linux kernel.

Finishing on a parenthesis regarding kFreeBSD architecture of Debian: despite being encouraged to build it by myself, I’m a bit puzzled.
On one hand, you have the installation-guide-kfreebsd-amd64 package, containing a few HTML files with the well-known documentation, broken links to non-existing kernel and initrds. On the other hand, the kfreebsd-source-10.1 package, which Makefile doesn’t seem to work at all. Hooray for Poettering.

BlueMind

April 24, 2015

0 Comments

Samuel MARTIN MORO

BlueMind is a mail solution, based on a few popular softwares such as Cyrus, Postfix, ElasticSearch or nginx, glued together using several java services.

Being still under active development: look closely after each update.
A good practice may be to list all ports in a LISTENING state, and the process it is related to. It’s been helping me several times, identifying which process is down, without restarting the whole enchilada.

BlueMind SMTP access is assumed by Postfix (*:25). Incoming mails are passed to a java process (bm-lmtp, 127.0.0.1:2400), that would eventually deliver its mail onto Cyrus (*:24).
BlueMind IMAP is served by nginx (*:143), proxifying requests to Cyrus (*:1143).
BlueMind Webmail is served by nginx as well. You may recognize a fork of Roundcube. PHP is served by php-fpm, former versions involved apaches’ mod_php.

On the bright side, BlueMind provides with a fully-functionnal (and exchange compliant) mail solution. If my previous tests (2.x branch) suffered from poor cyrus configuration (could process 4 mails at a time, too slow to handle the tens of mailing-lists I’m subscribed to). On my current seutp (3.14), I had no need to modify anything.
Although, BlueMind Community Edition does not support applicative updates: it is mandatory to re-install a new server, and the migrate your mailboxes. You may otherwise become a partner (reseller, integrator, …) or most likely, a paying customer, to be provided with BlueMind updates – in which case, you would most likely be assisted by a developer from BlueMind during the update process and your first hours, running the new version.

There’s nothing much to tell about BlueMind administration console. The most interesting parts being system-management related, and especially mailbox backups, archiving mails on an age (and/or location) basis.
On that subject: BlueMind backups are stored on /var/backups/bluemind. Ideally, try to keep this folder onto a different physical medium.

For those familiar with Cyrus, you may use cyradm command.
Retrieve your admin password:

deepthroat:/etc# grep proxy /imapd.conf
proxy_authname: admin0
proxy_password: some-alpha-numeric-chain

You may then connect:

deepthroat:/etc# cyradm -u admin0 localhost
[enter the proxy_password value]

Sometimes, shit append, and you may see IOERROR messages, followed by header CRC mismatch, invalid cache record, mailbox format corruption detected, … In which case, you would need to run something like:

deepthroat:/etc# /usr/lib/cyrus/bin/reconstruct user/admin/nzbindex@unetresgrossebite.com

Assuming you are using several postfix instances, relaying mails to your BlueMind server, then try to replicate virtual_aliases, virtual_domains and virtual_mailbox maps everywhere – instead of letting everything pass, and delegating actual consistency checks to your BlueMind instance.

Finally, note there is one critical mail service, that BlueMind does not provide: SPAM (and/or virus) checking.
Assuming you can deal with spamassassin and clamav, you may want to alter postfix routing prior to sending your mail to bm-lmtp. Assuming you want to educate your spamassassin with user input, keep it running close to your cyrus server.

Ceph

April 19, 2015

0 Comments

Samuel MARTIN MORO

A long story, binds me with Ceph (partially documented on my wiki).
Mostly at Smile, but also at home, I tested versions from 0.47 to what is now 0.87.
By Christmas 2014, I bought myself 5 Prolian MicroServer, trying to allocate dedicated physical resources to manage virtual file systems.
Nodes all contains a 60G SSD, used for the root filesystem, as well as OSD journals. 1 512GB SSD disk, for “fast” filesystems, and 3 disks from 1 to 4T, filling up the left slots.

Almost a month later, while at work, one of my node stopped answering. Immediately, the cluster tried to recover degraded placement groups, by duplicating the remaining replica to some free space.
At some point, the recovery process filled up a disk until reaching its limit. The cluster was now missing space.
When I realized the problem, I left work early, rushed home and reboot the failing server.
Recovering the missing disks, the cluster remained on a degraded state, because of the filled disk from earlier. The daemon managing this disk was unable to start, because of the disk being too full. So I ended up dropping its content, reformatting the disk, hoping I would still have an other replica of whatever I was destroying.
On the bright side, the cluster started re-balancing itself, I could finally restart my VMs, … Retrospectively, I don’t see how else I could have get it back up otherwise.
Meanwhile, I actually did lost data in the process. One placement group, remaining degraded.

The cluster being unusable, and yet storing somewhat relevant data for my personal use, I ended up creating a new pool (implying: new placement groups), re-creating my cloud storage space from scratch.

After two weeks on ceph IRC, I found one person with the same problem, no one with an actual solution.
I vaguely heard of the possibility to `cat’ files from osd partitions, to reconstruct an image file matching the one presented by ceph. Pending further investigations, …

And here we are, a few months latter.
The situation is getting worse every day.
I’m finally investigating.

Basically, you have to look at rbd infos to retrieve a string, contained by all names of files holding your data:

moros:~# rbd -p disaster info one-70
rbd image ‘one-70’:
size 195 GB in 50001 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.66d52ae8944a
format: 2
features: layering

From there, you’ll use a script to list all files you actually need to reconstruct your disk.

In my case, the script creates a directory $disk_name, and stores a file per physical host, listing corresponding blocks.

Then, you’ll use a script to store all these files into a single directory, for further processing.
Note file names all contains a backslash, therefor our script would need to connect to the OSD host and then, run some scp command to send the file to designed destination. Such mess requires to exchange SSH keys, … be warned.
An other way to do this may be to share your OSD roots, most likely using NFS, and mount them on the host reconstructing data.

Finally, we can follow Sebastien Han’s procedure, relying on rbd_restore.

You should end up with a disk image file.
If your disks all contains partitions, which is my case, … then fdisk -l the obtained file.
Get the offset where starts the partition you would like to recover, and the block size. Run dd if=your_image bs=block_size skip=start_offset of=$disk_from_partition.
Run fsck on the image obtained.
If you see complaints about bad geometry: block count xx exceeds size of device (yy blocks), then fire up resize2fs.
Assuming xfs, if your dmesg tells about attempt to access beyond end of device, look for a want=XXX, limit=YYY line to deduce the amount of space missing from your disk, then using dd if=/dev/zero of=restored_image seek=${actual size + length to add in MB} obs=1M count=0 should append zeroes to your image, allowing you to mount your disk.
An exhaustive log is available there.

eCryptfs

April 19, 2015

0 Comments

Samuel MARTIN MORO

You may be familiar with eCryptfs, a disk encryption software, especially known for being shipped with Ubuntu default installations, being the ‘Encrypted Home’ underlying solution.

From what I gathered of empirical experiences, I would say eCryptfs, AKA Enterprise Cryptographic Filesystem, should be avoided in both enterprise and personal use cases.

My first surprise was while crawling a site for its record: creating a file per author, I ended up with more than 5000 files in a single directory. At which point, your dmesg should show something like several ‘Valid eCryptfs header not found in file header region or xattr region, inode XXX‘, then a fiew ‘Watchdog[..]: segfault at 0 ip $addr1 sp $addr2 error 6 in libcontent.so[$addr3]‘

The second surprise, a few days later, while recovering rbd from my ceph cluster. Storing all parts from a disk into a same directory, again, I ended up with folders holding several thousands of file, and my dmesg going mad.
You would notice moving your folder outside of your ecryptfs directory will fix the problem.

Of course, most users won’t recover their ceph nor crawl a site for its complete database. Although, configuring Thunderbird/IceDove, you may end up caching enough mails from your imap/pop server to reach the limits I did.

This is not my first experience with cryptographic solutions.
Once upon a time, TrueCrypt was offering a very exhaustive toolkit, managing from devices to files – so much so, my end-of-studies project was bout forking TrueCrypt, adding features the maintainer did not wanted to see in its product (BootTruster).
On today’s Linux systems, a more classic way to do it would be to use Luks (Linux Unified Key Setup), based on a standardized device-mapper: dm-crypt.

Anyway: next time you’re asked about encrypting your home file system, think twice about what solution you’re going to use. Then, chose Luks.

PXE

April 16, 2015

0 Comments

Samuel MARTIN MORO

Continuing on my project to re-do my internal services, today we’ll talk about Preboot eXecution Environnement, AKA PXE.
The principle is quite simple, and widely spread in everyday infrastructures.

Most PXE setups I’ve heard about involve serving installation images for a single system.
Most tutorials I’ve found, struggling configuring my service, would stick to serving a specific image, most likely taken from the host system, …
Although PXE is able to provide with a complete set of systems, as well as live environments from memtests, BIOS update utilities as well as liveCDs.

Being refractory to the idea of setting up an NFS server on a PXE server, I’m avoiding this later usage.
The main reason being my former work involved OpenVZ virtualization: using NFS is usually a pain in the ass, assuming your physical host/virtual host combination actually support NFS.
If setting up a NFS client is most likely doable, setting up a NFS server remain a bad idea, … and well, I ended up avoiding all kind of NFS services running on my core services.

Anyway, back to PXE.
Relying on industry standards such as DHCP (RFC1531, RFC4578), TFTP (RFC783, RFC906) then BOOTP (RFC951), PXE uses a set of low-level protocols clients, allowing lightweight implementation, inducing a low enough footprint to fit on regular network controllers.
Most DHCP servers can be reconfigured to provide its clients with a PXE server address (‘next-server’) and a file (‘filename’), supposed to be distributed by your PXE.
Once the client retrieved these info from its DHCP, it is able to download and render the downloaded file. Once the first image (usually named pxelinux.0, pxeboot.0, or anyway you want…, there’s a lot of declinations, mostly doing the same thing) is executed, the client would query the TFTP server again for files contained within pxelinux.cfg directory.
Looking for additional configurations, the PXE client would usually look up for a file matching its complete hardware address (eg pxelinux.cfg/aa-bb-cc-dd-ee-ff having the booting NIC hardware address set to AA:BB:CC:DD:EE:FF). The client would then query for a file matching its complete IP (eg pxelinux.cfg/C000025A having the client IP address set to 192.0.2.90). If not found, the last digit of the previous query is removed, and then again, … until the string is empty, at which point, the client queries for a file ‘default’.
The ‘default‘ file, as of any file from pxelinux.cfg, can either define menus layout and several boot options, or simply boot a given image without further notice.

PXE use a very small set of commands. The kernel, initrd and append being the most common booting a system.
Setting up a system requires you to download the proper kernel and initramfs, then declare a menu item loading them. Automating the process is thus pretty straight-forward: two files to downloads, a template to include from your main menu, … See puppet defines, downloading CentOS, CoreOS, Debian, Fedora, FreeBSD, mfsBSD, OpenBSD, OpenSUSE and Ubuntu.
Most initramfs could be started with specific arguments such as which device to use, which keyboard layout to load, which repository to use, or even a URL the system installation process would use to retrieve either a preseed, a kickstart, or anything that may help automating installation process.

If unattended installations have limited benefits within smaller networks – not using CDs is still a big one – they are tremendous towards huge networks.
Targeting such infrastructures, products emerged bringing PXE to its next level.
Cobbler, for instance, a system-agnostic Python-based command-line tool managing PXE configuration. Its primary functionality is to automate repetitive actions and encourage the reuse of existing data through templating.
MAAS, from Canonical, in combination with Juju, based on Ubuntu, able to bootstrap complete infrastructures — which is what they do, I guess, with BootStack.

DNS

April 15, 2015

0 Comments

Samuel MARTIN MORO

Wondering on wikipedia, we can learn in the early ages of the Internet, some guy at Stanford Research International maintained a file mapping alphanumeric hostnames to their numeric addresses on the ARPANET.
Later on, the first concepts were defined (RFC882 and RFC883, then superseded by RFC1034 and RFC1035), leading the the first implementation (bind, 1984).

I’m not especially familiar with the history of a technology older than me, and yet DNS servers are one of the few cornerstones of networks, from your local network to Internet as we know it.
The idea hasn’t changed since ARPANET: we prefer to use human-readable juxtaposed words, over a 4 digits identifier, to access a service. Thus, we’ve multiplied contributions to scale, redound, decentralize or secure a unified and standardized directory.

While hosting your own DNS server could be pretty straight-forward, popular services are usually popular targets, with their flows. Taking care of defining the function that would implement your server is crucial, though usually overlooked.

The most common attack targeting DNS servers is well known, exploiting poorly configured servers since over 10 years : amplification attacks.
An attacker would spoof its IP to his victim’s one, querying for ? ANY isc.org (64b in), resulting in a ~3200b answer. Depending on UDP: no connection state. Spoofing only requires you to write your headers properly (without being able to answer some ACK), which makes any UDP protocol especially vulnerable.
Assuming your DNS server answer to such queries, our attacker would have amplified its traffic by 50, as well as hidden his address.
Note while you may configure ACLs at software level, even a denied client would be answered to. To avoid sending anything that may end up in a DDOS attempt, there’s nothing like a goodol’ firewall.

While carrier-grade solution may involve reverse-route checking of inbound traffic, the only thing to do for us regular folks, is to restrict accesses to our DNS servers, to the very clients we know.
Renting servers in Dedibox and Leaseweb facilities, I’ve found out both management interfaces allow me to replicate my zones in their DNS servers. Thus, my split-horizon is configured to publicly announce Leaseweb and Dedibox’s NS as my masters, while these are the only public clients allowed to reach my masters.

Bind / named

The historic implementation, most widely-spread. Its only concurrent in terms of features being PowerDNS.
Bind implements what they call DLZ (Dynamically-Loadable Zones), allowing the storage of records in a database such as PostgreSQL or ODBC.
Having tested the OpenLDAP connector for a few years, I’ld mostly complain about being forced to patch and build my own package (RFC3986, http://repository.unetresgrossebite.com/sources/dlz-ldap+r1-9.9.3-P2.patch), and being unable to resolve adresses containing characters such as ‘(‘. In the end, keeping my databases as plain file is easier to maintain.

Its usages include authoritative zone serving, zone replication and caching, split-horizon, TSIG, IPv6 and wildcard records, records caching, DNS & DNSSEC resolution and recursion, lying DNS using RPZ, …

see:

Unbound

A relatively new solution (2004), targeting cache and DNSSEC related features.
If you may declare records in your configuration, Unbound won’t answer to transfert queries, and thus does not qualify as an authoritative name server.
Unbound is perfect either for home usage, or as a local cache for your servers.

Also vulnerable to amplification attacks, keep in mind to deny accesses from unexpected clients.

see:

NSD

NSD is a drop-in replacement for BIND zone serving features, while it won’t provide with split horizon, caching or recursive DNS resolution.
Its features being restricted to serving and replicating zones, NSD only applies for authoritative usage and should be used in conjunction with some cache solution such as Unbound.

Dnsmasq
One of the most popular implementation, embedded in devices such as cable box routers, Linksys WRT54-G mods, or even Ubuntu desktop installations. The key feature being Dnsmasq is a DNS server, embedding a DHCP server. Or vice-versa.
Like Unbound, serving configured records is possible, though Dnsmasq is not authoritative.

Vulnerable to amplification attacks, but most likely not to be exposed.

see:

Former ansible repository, dnsmasq module

UTGB Refactoring

Back on the subject, refactoring my services. Today’s topic, obviously, DNS.

Being particularly found of my split-horizon, NSD does not apply in my case.
Continuing to validate and deploy my puppet modules, reinstalling my self-hosted services, I’ve ended up setting a new BIND server, rewriting my zones from LDAP to plain-text files.
After several spontaneous rewrites from scratches over the last few years, one thing I often miss managing my DNS zones, is the ability to synchronize a set of values (let’s say, NS, MX and TXT records) towards all my zones, without having to edit 30 files. Thus, you’ll note the latest named module used in my puppet repository stores temporary zones in /usr/share/dnsgen, and allow you to use a single SOA record template as well as a coupe of zone headers and trailers to generate an exhaustive split-horizon configuration.

Next step on the subject would be to setup some key infrastructure, then publish my key to Gandi and serve my own DNSSEC records, …

Meanwhile, I’ve started replacing my DNS caches as well. From unbound to unbound.
The specificity of my caches, is that their configuration declare about 100.000 names, redirecting them onto some locally-hosted pixel server.
The big news in this new version of unbound module, is that the names source isn’t static any more, and would be regularly downloaded and updated. Actual list now include around 6.000 entries, and might be completed later on.

UTGB Refactoring

April 15, 2015

0 Comments

Samuel MARTIN MORO

Since december, and until further notice, I’ve been experimenting on my services, replacing my old VMs one by one.

Corresponding puppet modules are available at https://gitlab.unetresgrossebite.com/DevOps/puppet/

Experiencing some Ceph disaster (lost PG), the next big step is to drop two hosts from my current crushmap, using them to start a new Ceph cluster, and migrate my disks progressively.

EDIT from May 17th:

All my hosts are now dependent of this new puppetmaster.

Monthly archives "April"