Crawlers

July 30, 2015

0 Comments

Hosting public site, you’ve dealt with them already.
Using your favorite search engine, you’re indirectly subject to their work as well.
Crawlers are bots, querying and sometimes mapping your site, completing some search engine database.

When I started looking at the subject, a few years back, you only had to know about /robots.txt, to potentially prevent your site from being indexed, or at least restrict such accesses to relevant contents of your site.
More recently, we’ve seen the introduction of some XML files such as sitemap, allowing to efficiently serve to search engines a map of your site.
This year, Google reviewed his “rules” to prioritize responsive and platform-optimized sites as well. As such, they are now recommending to allow crawling for JavaScript and CSS files, warning –threatening– that preventing these accesses could result in your ranking being lowered.

At this point, indexing scripts and style-sheets, you might say – I’m surprised not to find the remark in the comments – that google actually indexes your site vulnerabilities, creating not only a directory of the known internet, but a complete map with everyones’ hidden pipes, that could some day be used to breach your site – if not already.

Even if Google is the major actor on that matter, you probably have dealt with Yahoo, Yandex.ru, MJ12, Voltron, … which practices are similar. Over the last years, checking your web server logs, you might have noticed a significant increase in the proportion of bot queries over human visits. In part due to search engines recrudescence, though I suspect mostly thanks to bot nets.
Identifying these crawlers could be done checking the UserAgent, sent with all http requests. On small traffic sites, crawlers may very well be your only clients.

Assuming your sites are subject to DDOS attacks, scans for some software vulnerability (top targeted solutions being wordpress, phpbb and phpmyadmin), you should know attackers will eventually masquerade their user-agent. Most likely branding themselves as Googlebot.

To guarantee a “googlebot” branded query actually comes out some google server, you just need to check the pointer record associated to this client’s IP. A way to do so in Apache (2.4) could be to use something like this (PoC to complete/experiment).
Still, maybe is it wiser to just drop all google requests as well. Without encouraging Tor usage, it’s probably time to switch to DuckDuckGo?

Otherwise, an other good way to deny these connections is described here, I may try to add something like this to my puppet classes.

BOFH meditations

Crawlers

Leave a reply Cancel reply