The rapid rise in malicious bots crawling the web have caused hosting companies, content delivery networks (CDN) and server admins to block bots they do not recognise in log file data. Unfortunately, this means that DeepCrawl can be accidentally blocked by a client’s website.
In this guide, we will help you identify indicators that a crawl is being blocked and provide solutions to unblock DeepCrawl and allow you to crawl your site.
How to identify that a crawl is being blocked
The most common indicators of a crawl being blocked are listed below.
Unauthorised and Forbidden crawl errors
If a server is blocking DeepCrawl from accessing a website, then reports will display a lot of URLs with HTTP 401 unauthorised and 403 forbidden header responses.
To find URLs with these errors navigate to the Unauthorised Pages report.
Too Many Requests
If DeepCrawl is exceeding the number of requests your site is able to receive, then reports will display a lot of URLs with HTTP 429 Too Many Requests header responses.
To find URLs with these errors navigate to the Uncategorised HTTP Response Codes report.
Slow crawl and connection timeout errors
This will cause your crawl to be successful initially but it will begin to slow down progressively and eventually appear to stop running completely.
Any URLs crawled and displayed in DeepCrawl will show up with connection timeout errors in reports.
Why does DeepCrawl get blocked?
The reason for your crawl being blocked is most likely due to one of the following:
- DeepCrawl uses Amazon Web Services which may be blocked by default as they are also often used by scrapers.
- An automated system is in place on the server or CDN which detects and blocks suspicious activity. For example, if a certain number of requests exceed a set limit Cloudflare starts blocking requests.
- A manual block has been implemented by a server administrator, based on manual inspection of server activity, possibly triggered by a high load caused by the crawl, or a large number of crawl errors.
- The use of a Googlebot user agent having led to the failure of a reverse DNS lookup, appearing to be a scraper which is spoofing Googlebot.
How can I remove the block and crawl my website?
Our team recommends the following solutions if you suspect that DeepCrawl is being blocked.
Whitelist our IP
Providing you know the site; you can ask the server administrators to whitelist the default IP that DeepCrawl uses to crawl: 188.8.131.52.
Change user-agent in project settings
Some websites will block requests which come from a Googlebot user-agent (DeepCrawl’s default user-agent) but do not originate from a Google IP address. In this scenario, selecting a different user agent in a crawl’s advanced project settings often makes the crawl succeed.
Use DeepCrawl’s ‘Stealth Mode’ feature, which can be found in a crawl’s advanced project settings. Stealth mode crawls a website slowly using a large pool of IP addresses and user agents. This typically avoids many types of bot detection.
Frequently asked questions
Do I need to implement more than one solution?
Although one solution can help unblock DeepCrawl, sometimes it is necessary to try another method. For example, as well as whitelisting DeepCrawl’s IP it might also require changing the user-agent of the project.
Why can I still not crawl my website?
If you still can’t crawl your website (after trying multiple solutions) then we recommend reading the how to fix failed website crawls guide for more tips to debug failed crawls.
How do I find out if my website is using a CDN?
If you are unsure whether a website is using a CDN then read the following guide on how to identify what CDN (if any) a website is using.
How can I identify DeepCrawl in my log files?
DeepCrawl will always identify itself by including ‘DeepCrawl’ within the user agent string.
e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot
Any further questions?
If your crawls are still getting blocked, even with implementing the solutions suggested above, then please don’t hesitate to get in touch.