If a crawl is failing, there are a few indicators that it is being deliberately blocked by the web server:
- You are seeing a lot of URLs returning a 401 or 403 response code in your reports.
- The crawl is successful initially but slows down progressively, and appears to stop running completely, which then show up as connection timeout errors in the reports.
The reason for your crawl being blocked is most likely one of the following:
- DeepCrawl uses Amazon Web Services which may be blocked by default as they are also often used by scrapers.
- An automated system is in place on the server which detects and block suspicious activity on the server.
- A manual block has been implemented by a server administrator, based on manual inspection of server activity, possibly triggered by a high load caused by the crawl, or a large number of crawl errors.
- The use of a Googlebot user agent having led to the failure of a reverse DNS lookup, appearing to be a scraper which is spoofing Googlebot.
We recommend trying the following solutions to blocked crawls
- Providing you know the site, you can ask the server administrators to whitelist the IP that DeepCrawl uses to crawl: 184.108.40.206
- Some websites will block requests which come from a Googlebot user agent (DeepCrawl's default user agent) but do not originate from a Google IP address. In this scenario, selecting a different user agent often makes the crawl succeed.
- DeepCrawl's 'Stealth Mode' crawls a website slowly using a large pool of IP addresses and user agents. This typically avoids many types of bot detection.