Bots crawling our website every minute are becoming a problem. They are using server resources without giving anything back.
With the exception of search engine bots like Google or DuckDuckGo they are of no use for our website.
In this article we will provide two most common ways to protect your website from unwanted bots, crawlers and spiders. We will look into limiting their crawl rate or blocking them completely from entering the website.
Limiting access to unwanted visitors may also help you improve your website’s SEO.
Looking at the visitors statistics pulled from the server log files revealed huge bots activity eating away our bandwidth:
If you are not using some tool to parse your access log files, you should do it now! We could recommend awstats. It’s easy to use and is open source.
Crawlers from marketing and ratings agencies like Ahrefs, Semrush and such are considered bad as they eat up server load and provide statistics about your website to your competitors.
Ahrefs turns out to be particularly bad. We have looked into it a couple of month ago and blocked it’s crawler through the website’s robots.txt file:
User-agent: AhrefsBot Disallow: /
It shouldn’t be in our top 10 visitor bots statistics table! They wrote on their website that Ahref bots respect robots.txt:
In reality, Ahrefs bot doesn’t respect robots.txt at all! They crawl our website as shown by our server access statistics. And they do it a lot!
Time to bring the big guns!
What we will do now is to choose top 5 bad bots that we don’t want visiting our website and lock them out on a server side through the .htaccess file. When they visit our website, they will get a 403 Access Forbidden error.
Here are our top 5 bad bots:
There is a huge list of other bots that you can block at tab-studio. We won’t bother with so many, but will block only the most active spiders.
Here is the code to insert into your .htaccess file to block the bots:
#bad bots start RewriteCond %{HTTP_USER_AGENT} \ ahrefs|\ semrushbot|\ mj12bot|\ dotbot|\ ccbot|\ [NC] RewriteRule .* - [F] #bad bots end
This code didn’t work for some of our websites that had other blocks on.
Here is another version of how to block multiple bots in one statement in .htaccess:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^.*(ahrefs|semrushbot|mj12bot|dotbot|ccbot).*$ [NC] RewriteRule .* - [F,L]
Read also: Secure your website’s images from stealing.
Crawlers that visit our website in order to index the content for the search engine users are good bots because they send their visitors to us.
But, some of these good bots are doing just a little bit too much. They crawl a lot, but doesn’t give back on the same level as Google.
The bots that are good, but with too much activity will be slowed down to crawl less.
Our measuring point will be Google bots. Most of the visitors come from Google. Bots that do not perform on the Google level, but eat the same or more of our resources will get their crawl rate cut.
Here are the bots that we would like to slow down:
The list is from our server access logs with the most amount of bandwidth used.
In order to limit the crawl rate of good bots we will use the robots.txt file. These are good bots and they will probably respect our robots.txt file.
Add the crawl-delay
directive to your robots.txt with the amount of seconds between each page crawl, for example, 10 seconds delay:
crawl-delay: 10
We would like to give a 10 seconds delay to Bingbot and 10 minutes to other 3 bots:
User-agent: bingbot Crawl-delay: 10 User-agent: oBot Crawl-delay: 600 User-agent: GrapeshotCrawler Crawl-delay: 600 User-agent: proximic Crawl-delay: 600
Let’s see how it will perform!