How to stop bots from crawling your website

By , last updated November 15, 2018

Bots crawling our website every minute are becoming a problem. They are using server resources without giving anything back.

With the exception of search engine bots like Google or DuckDuckGo they are of no use for our website.

In this article we will provide two most common ways to protect your website from unwanted bots, crawlers and spiders. We will look into limiting their crawl rate or blocking them completely from entering the website.

Limiting access to unwanted visitors may also help you improve your website’s SEO.

How to find unwanted crawlers

Looking at the visitors statistics pulled from the server log files revealed huge bots activity eating away our bandwidth:

If you are not using some tool to parse your access log files, you should do it now!

Bad bots

Crawlers from marketing and ratings agencies like Ahrefs, Semrush and such are considered bad as they eat up server load and provide statistics about your website to your competitors.

Ahrefs

Ahrefs turns out to be particularly bad. We have looked into it a couple of month ago and blocked it’s crawler through the website’s robots.txt file:

User-agent: AhrefsBot 
Disallow: /

It shouldn’t be in our top 10 visitor bots statistics table! They wrote on their website that Ahref bots respect robots.txt:

In reality, Ahrefs bot doesn’t respect robots.txt at all! They crawl our website as show by our server access statistics. And they do it a lot!

Time to bring the big guns!

Top 5 bad bots

What we will do is to choose top 5 bad bots that we don’t want visiting our website and lock them out on a server side through the .htaccess file. When they visit our website, they will get a 403 Access Forbidden error.

Here are our top 5 bad bots:

  1. Ahrefs – seo tool bot
  2. Semrush – seo tool bot
  3. MJ12bot or Majestic bot – seo tool
  4. DotBot – we are not an ecommerce site
  5. CCBot – marketing

There is a huge list of other bots that you can block st tab-studio. We won’t bother with so many, but will block only the most active spiders.

Block crawlers with .htaccess

Here is a code to insert into .htaccess file to block bad bots:

#bad bots start
RewriteCond %{HTTP_USER_AGENT} \
ahrefs|\
semrushbot|\
mj12bot|\
dotbot|\
ccbot|\
 [NC]
RewriteRule .* - [F]
#bad bots end

This code didn’t work for some of our websites that had other blocks on.

Here is another version of how to block multiple bots in one statement in .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(ahrefs|semrushbot|mj12bot|dotbot|ccbot).*$ [NC]
RewriteRule .* - [F,L]

Read also: Secure your website’s images from stealing.

Good bots

Crawlers that visit our website in order to index the content for the search engine users are good bots because they send their visitors to us.

But, some of these good bots are doing just a little bit too much. They crawl a lot, but doesn’t give back on the same level as Google.

The bots that are good, but with too much activity will be slowed down to crawl less.

Our measuring point will be Google bots. Most of the visitors come from Google. Bots that do not perform on the Google level, but eat the same or more of our resources will get their crawl rate cut.

Here are the bots that we would like to slow down:

  1. Bingbot – not many people come through Bing so no point in it taking so many resources from us.
  2. oBot – some research center filtering the content. Never heard of it. It may be useful so we won’t block it, but will slow down a bit.
  3. GrapeshotCrawler – ads
  4. proximic – marketing

The list is from our server access logs with the most amount of bandwidth used.

Limit crawl rate

In order to limit the crawl rate of good bots we will use the robots.txt file. These are good bots and they will probably respect our robots.txt file.

Add the crawl-delay directive to your robots.txt with the amount of seconds between each page crawl, for example, 10 seconds delay:

crawl-delay: 10

We would like to give a 10 seconds delay to Bingbot and 10 minutes to other 3 bots:

User-agent: bingbot
Crawl-delay: 10

User-agent: oBot 
Crawl-delay: 600

User-agent: GrapeshotCrawler 
Crawl-delay: 600

User-agent: proximic 
Crawl-delay: 600

Let’s see how it will perform!

Comments

Be the first to comment.

Leave a Reply


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*