Deepcrawl is now Lumar. Read more.
DeepcrawlはLumarになりました。 詳細はこちら

How to Fix Your Failed Website Crawls

 

The Site itself

Sometimes, when running a crawl on a site (or a section of a site), you may find that it isn’t progressing past the first level of URLs. When this happens, only the base domain, or ‘Start URLs’ are actually crawled.

This problem has several possible causes, and various ways in which you can rectify the issue. These are listed along with their basic solutions below:

SYMPTOM ACTION
1 Disallowed URL is Returned Check the Robots.txt File
1 Indexable URL returning 200 Status Code Check with Fetch as Lumar*Issue with JavaScript
or iFrame Links*First URL has no links
1 URL with Connection Error User Agent or IP Address Blocked
0 URLs Returned Check the Project Settings for Incorrect Base Domain
1 Indexable URL returning 200 Status Code Check the Advanced Settings: Included URLs Restriction

 

Robots.txt File

Most often, problems originate in the robots.txt file. The contents of this file may result in the site itself blocking crawlers from accessing it.

Most crawlers aim to follow the same rules as the search engine spiders, and this includes adhering to the directions in the robots.txt file.

Therefore, if your site is set to show the following, it will not crawl beyond the first page:

useragenty

This often happens in the case of crawling a staging environment before it goes live.

You will see that the crawl returns 1 URL that is shown in the Disallowed URLs report.
 

Correcting the Issue with Lumar

Using the ‘Advanced Settings’ at the bottom of Step 4 in the crawl setup, it is possible to overwrite the robots.txt file, and allow Lumar access to blocked URLs.

By adding in the following settings, Lumar will then follow the rules set out in this section – as opposed to the live file:

How to fix failed crawls in Lumar - Robots.txt file
 

JavaScript or iFrame Links

When crawling pages on a website, the crawler will either begin the crawl from the base domain, or from the URLs entered in the Start URLs setting. If no links are encountered on the first page it reaches, then the crawl will not be able to continue.

If the start URL is generated dynamically using JavaScript, then there may not be any discoverable links in the HTML that the crawler can follow. This is also the case when the links on the start URL are within an iFrame.
 

Correcting the Issue with Lumar

If this is the case for all the starting URLs in the project, then the crawler will not be able to crawl the site.

However, if only the first URL is affected, it is possible to rectify the issue. By using the Start URLs section in the Advanced Settings, you can ensure that the crawl begins on a page where the links are neither in an iFrame or on a JavaScript generated URL.

Once again, you can use the ‘Start URLs’ setting, and start the crawl from a section of the site that bypasses the restriction.

How to fix failed crawls in Lumar - JavaScript of iFrame links

This means that you will be able to discover and follow these links, and your crawl will complete successfully.

It is worth bearing in mind that should the crawler encounter additional pages that are JavaScript generated, it will not be able to continue. It is therefore still possible for the web crawl to miss sections of a site.
 

First URL has no links

Sometimes, the issue is that there are simply no links on the first page. This might occur when this page has some type of restriction to visitors. For example, an age restriction on alcoholic beverages:

guiness

The crawler will likely return 1 indexable URL with a 200 status code. It is also likely to be a unique page.
 

Correcting the Issue with Lumar

Once again, you can use the ‘Start URLs’ setting, and start the crawl from a section of the site that bypasses the restriction.

How to fix failed crawls in Lumar - JavaScript of iFrame links
 

User Agent or IP Address Blocked

If your crawl has encountered only one URL, and that URL is returning a connection error, it may be that the crawler is being blocked. This is usually done in one of two ways: by blocking either the User Agent or the IP address.

For instance, your site may be set up to automatically block user agents that are not Google (or other search engines.) Alternatively, the site may have implemented a firewall, calibrated to block the IP address from which the crawl originates.
 

Correcting the Issue with Lumar

In the case of the block occurring via the user agent, you can try selecting a different user agent from the available list, or use your own custom user agent, as shown below:

How to fix failed crawls in Lumar - User agent or IP address blocked

If it is the IP address that is blocked, find the IP address in ‘Crawler IP Settings’ in ‘Advanced Settings’ and give it to your network administrators to be whitelisted.

How to fix failed crawls in Lumar - User agent or IP address blocked
 

Crawl Settings

It’s possible that the crawl setup is causing only 1 URL to be returned.
 

Incorrect Base Domain

When you enter the base domain into the Project Settings during crawl setup, Lumar will automatically check the domain and alert you if it is either redirecting or not responding.

How to fix failed crawls in Lumar - Incorrect base domain

Sometimes, you might choose not to crawl the version that the base domain is redirecting to. In this case, you may find that the crawl returns 1 URL, with either a 301 or 302 status code.

This is because Lumar will only crawl URLs that match the base domain. The site may be on https instead of http, or a different TLD, or a different domain altogether.

Similarly, it may be that your site is set to geographically redirect based on IP. An example might be to a different ccTLD or subdomain; meaning that the site once again falls outside the scope of the crawl.

Whenever you enter a URL into the base domain section of Project Settings, Lumar will only crawl URLs on that specific domain. It will not follow links on the redirected URL.

It may be that the content is actually hosted on a subdomain, or even a number of subdomains with only the homepage on the base domain. In this instance, those subdomains won’t be crawled unless specifically requested.
 

Correcting the Issue with Lumar

In the Project Settings , you can check and confirm whether the base domain entered is correct. It’s important to ensure that the correct protocol is selected (either http or https)

How to fix failed crawls in Lumar - Incorrect base domain

If your content is on a subdomain, then you will also need to make sure that the ‘crawl subdomains’ box is checked, so that links to those pages can be followed.

How to fix failed crawls in Lumar - Incorrect base domain
 

Included URLs

If you have restricted your crawl to one or more specific sections of the site, it could be that it is these restrictions that are leading to the crawl being unable to run.

How to fix failed crawls - Included URLs

The ‘Include Only URLs’ setting in Lumar can be used to ensure that we only crawl certain sections of a site, and will ignore all other areas. As part of this, you will need to provide a ‘Start URL’ within that section, unless it is linked to from the base domain.

How to fix failed crawls - Included URLs

However, should the links leading from the Start URL not fall within the folder referenced in the ‘Included Only’ setting, then they will be outside the scope of the crawl.
 

Correcting the Issue with Lumar

The best way to deal with this issue is to change the ‘Start URL’ to a page within the folder that does have internal links to URLs within that section.

How to fix failed crawls - Included URLs

Avatar image for Adam Gent
Adam Gent

Product Manager & SEO Professional

Search Engine Optimization (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Newsletter

Get the best digital marketing & SEO insights, straight to your inbox