How to Debug Blocked Crawls in DeepCrawl

Adam Gent
Adam Gent

On 17th September 2019 • 5 min read

The rapid rise in malicious bots crawling the web have caused hosting companies, content delivery networks (CDN) and server admins to block bots they do not recognise in log file data. Unfortunately, this means that DeepCrawl can be accidentally blocked by a client’s website.

In this guide, we will help you identify indicators that a crawl is being blocked and provide solutions to unblock DeepCrawl and allow you to crawl your site.

 

How to identify that a crawl is being blocked

The most common indicators of a crawl being blocked are listed below.

Unauthorised and Forbidden crawl errors

If a server is blocking DeepCrawl from accessing a website, then reports will display a lot of URLs with HTTP 401 unauthorised and 403 forbidden header responses.

unauthorised and forbidden crawl errors

To find URLs with these errors navigate to the Unauthorised Pages report.

Too Many Requests

If DeepCrawl is exceeding the number of requests your site is able to receive, then reports will display a lot of URLs with HTTP 429 Too Many Requests header responses.

too many requests error

To find URLs with these errors navigate to the Uncategorised HTTP Response Codes report.

Slow crawl and connection timeout errors

This will cause your crawl to be successful initially but it will begin to slow down progressively and eventually appear to stop running completely.

Any URLs crawled and displayed in DeepCrawl will show up with connection timeout errors in reports.

 

Why does DeepCrawl get blocked?

The reason for your crawl being blocked is most likely due to one of the following:

 

How can I remove the block and crawl my website?

Our team recommends the following solutions if you suspect that DeepCrawl is being blocked.

Whitelist our IP

Providing you know the site; you can ask the server administrators to whitelist the default IP that DeepCrawl uses to crawl: 52.5.118.182.

DeepCrawl IP settings

Change user-agent in project settings

Some websites will block requests which come from a Googlebot user-agent (DeepCrawl's default user-agent) but do not originate from a Google IP address. In this scenario, selecting a different user agent in a crawl’s advanced project settings often makes the crawl succeed.

DeepCrawl user-agent settings

Stealth Mode

Use DeepCrawl's 'Stealth Mode' feature, which can be found in a crawl’s advanced project settings. Stealth mode crawls a website slowly using a large pool of IP addresses and user agents. This typically avoids many types of bot detection.

DeepCrawl stealth mode

Enable JavaScript rendering

Certain websites attempt to use JavaScript to block crawlers that do not execute the page. This type of block can normally be circumvented by enabling our JavaScript Renderer.

DeepCrawl JavaScript rendering

 

Frequently asked questions

Do I need to implement more than one solution?

Although one solution can help unblock DeepCrawl, sometimes it is necessary to try another method. For example, as well as whitelisting DeepCrawl’s IP it might also require changing the user-agent of the project.

Why can I still not crawl my website?

If you still can’t crawl your website (after trying multiple solutions) then we recommend reading the how to fix failed website crawls guide for more tips to debug failed crawls.

How do I find out if my website is using a CDN?

If you are unsure whether a website is using a CDN then read the following guide on how to identify what CDN (if any) a website is using.

How can I identify DeepCrawl in my log files?

DeepCrawl will always identify itself by including ‘DeepCrawl’ within the user agent string.
e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot

 

Any further questions?

If your crawls are still getting blocked, even with implementing the solutions suggested above, then please don’t hesitate to get in touch.

Author

Adam Gent
Adam Gent

Search Engine Optimisation (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

Subscribe today