{"id":157,"date":"2015-07-30T17:06:38","date_gmt":"2015-07-30T15:06:38","guid":{"rendered":"https:\/\/blog.unetresgrossebite.com\/?p=157"},"modified":"2015-08-01T03:26:58","modified_gmt":"2015-08-01T01:26:58","slug":"crawlers","status":"publish","type":"post","link":"https:\/\/blog.unetresgrossebite.com\/?p=157","title":{"rendered":"Crawlers"},"content":{"rendered":"<p>Hosting public site, you&#8217;ve dealt with them already.<br \/>\nUsing your favorite search engine, you&#8217;re indirectly subject to their work as well.<br \/>\nCrawlers are bots, querying and sometimes mapping your site, completing some search engine database.<\/p>\n<p>When I started looking at the subject, a few years back, you only had to know about \/robots.txt, to potentially\u00a0prevent your site from being indexed, or at least restrict such accesses to relevant contents\u00a0of your site.<br \/>\nMore recently, we&#8217;ve seen the introduction of some XML files such as sitemap, allowing to efficiently serve to search engines a map of your site.<br \/>\nThis year, <a href=\"http:\/\/googlewebmastercentral.blogspot.fr\/2015\/04\/rolling-out-mobile-friendly-update.html\">Google reviewed his &#8220;rules&#8221;<\/a> to prioritize responsive and platform-optimized sites as well. As such, they are now recommending to allow crawling for JavaScript and CSS\u00a0files, warning &#8211;<a href=\"http:\/\/googlewebmastercentral.blogspot.fr\/2014\/10\/updating-our-technical-webmaster.html\">threatening<\/a>&#8211; that preventing these accesses could result in your ranking being lowered.<\/p>\n<p>At this point, indexing scripts and style-sheets, you might say &#8211; I&#8217;m surprised not to find the remark in the comments &#8211; that google actually indexes your site vulnerabilities, creating not only a directory of the known internet, but a complete map with everyones&#8217;\u00a0hidden pipes, that could some day be used to breach your site &#8211; if not already.<\/p>\n<p>Even if Google is the major actor on that matter, you probably have dealt with Yahoo, Yandex.ru, MJ12, Voltron, &#8230; which practices are similar.\u00a0Over the last years, checking your web server logs, you might have noticed a significant increase in the proportion of bot queries over human visits. In part due to search engines recrudescence, though I suspect mostly thanks to bot nets.<br \/>\nIdentifying these crawlers could be done checking the UserAgent, sent with\u00a0all http requests. On small traffic sites,\u00a0crawlers may very well be your only clients.<\/p>\n<p>Assuming your sites are subject to DDOS attacks, scans for some software\u00a0vulnerability (top targeted solutions being wordpress, phpbb\u00a0and phpmyadmin), you should know attackers will eventually masquerade their user-agent. Most\u00a0likely branding themselves as Googlebot.<\/p>\n<p>To guarantee\u00a0a &#8220;googlebot&#8221; branded query actually comes out\u00a0some google server, you just need to check the pointer record associated to this client&#8217;s IP. A way to do so in Apache (2.4) could be to <a href=\"https:\/\/gitlab.unetresgrossebite.com\/DevOps\/puppet\/tree\/master\/modules\/apache\/templates\/seo.erb\">use something like this<\/a>\u00a0(PoC to complete\/experiment).<br \/>\nStill, maybe is it wiser to just drop all google requests as well. Without encouraging Tor usage,\u00a0it&#8217;s probably time to switch to DuckDuckGo?<\/p>\n<p>Otherwise, an other good way to deny these\u00a0connections is <a href=\"https:\/\/perishablepress.com\/eight-ways-to-blacklist-with-apaches-mod_rewrite\/\">described here<\/a>, I may try to add something like this to my puppet classes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hosting public site, you&#8217;ve dealt with them already. Using your favorite search engine, you&#8217;re indirectly subject to their work as well. Crawlers are bots, querying and sometimes mapping your site, completing some search engine database. When I started looking at the subject, a few years back, you only had to know about \/robots.txt, to potentially\u00a0prevent [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[8,10,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/157"}],"collection":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=157"}],"version-history":[{"count":8,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/157\/revisions"}],"predecessor-version":[{"id":176,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=\/wp\/v2\/posts\/157\/revisions\/176"}],"wp:attachment":[{"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.unetresgrossebite.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}