The Site itself

Sometimes, when running a crawl on a site (or a section of a site), you may find that it isn’t progressing past the first level of URLs. When this happens, only the base domain, or “start URLs” are actually crawled.

This problem has several possible causes, and various ways in which you can rectify the issue. These are listed along with their basic solutions below:

SYMPTOM ACTION
1 Disallowed URL is Returned Check the Robots.txt File
1 Indexable URL returning 200 Status Code Check with Fetch as DeepCrawl*Issue with Javascript or iFrame Links*First URL has no links
1 URL with Connection Error User Agent or IP Address Blocked
0 URLs Returned Check the Project Settings for Incorrect Base Domain
1 Indexable URL returning 200 Status Code Check the Advanced Settings: Included URLs Restriction

The Site Itself

Robots.txt File

Most often, problems originate in the robots.txt file. The contents of this file may result in the site itself blocking crawlers from accessing it.

Most crawlers aim to follow the same rules as the search engine spiders, and this includes adhering to the directions in the robots.txt file.

Therefore, if your site is set to show the following, it will not crawl beyond the first page:
This often happens in the case of crawling a staging environment before it goes live.

You will see that the crawl returns 1 URL that is shown in the Disallowed URLs report.

Correcting the Issue with DeepCrawl

Using the Advanced Settings, at the bottom of Step 4 in the crawl setup, it is possible to overwrite the robots.txt file, and allow DeepCrawl access to blocked URLs.

By adding in the following settings, DeepCrawl will then follow the rules set out in this section - as opposed to the live file:

Javascript or iFrame Links

When crawling pages on a website, the crawler will either begin the crawl from the base domain, or from the URLs entered in the Start URLs setting. If no links are encountered on the first page it reaches, then the crawl will not be able to continue.

If the start URL is generated dynamically using Javascript, then there may not be any discoverable links in the HTML that the crawler can follow. This is also the case when the links on the start URL are within an iFrame.

Correcting the Issue with DeepCrawl

If this is the case for all the starting URLs in the project, then the crawler will not be able to crawl the site.

However, if only the first URL is affected, it is possible to rectify the issue. By using the Start URLs section in the Advanced Settings, you can ensure that the crawl begins on a page where the links are neither in an iFrame or on a Javascript generated URL.

Once again, you can use the ‘Start URLs’ setting, and start the crawl from a section of the site that bypasses the restriction.
This means that you will be able to discover and follow these links, and your crawl will complete successfully.

It is worth bearing in mind that should the crawler encounter additional pages that are Javascript generated, it will not be able to continue. It is therefore still possible for the web crawl to miss sections of a site.

First URL has no links

Sometimes, the issue is that there are simply no links on the first page.  This might occur when this page has some type of restriction to visitors. For example, an age restriction on alcoholic beverages:
The crawler will likely return 1 indexable URL with a 200 status code. It is also likely to be a unique page.

Correcting the Issue with DeepCrawl

Once again, you can use the ‘Start URLs’ setting, and start the crawl from a section of the site that bypasses the restriction.

User Agent or IP Address Blocked

If your crawl has encountered only one URL, and that URL is returning a connection error, it may be that the crawler is being blocked. This is usually done in one of two ways: by blocking either the User Agent or the IP address.

For instance, your site may be set up to automatically block user agents that are not Google (or other search engines.) Alternatively, the site may have implemented a firewall, calibrated to block the IP address from which the crawl originates.

Correcting the Issue with DeepCrawl

In the case of the block occurring via the user agent, you can try selecting a different user agent from the available list, or use your own custom user agent, as shown below:
If it is the IP address that is blocked, you can use the ‘static IP’ address provided under Spider IP Settings, and allow access to this address.

Crawl Settings

It’s possible that the crawl setup is causing only 1 URL to be returned.

Incorrect Base Domain

When you enter the base domain into the Project Settings during crawl setup, DeepCrawl will automatically check the domain, and alert you if it is either redirecting or not responding.
Sometimes, you might choose not to crawl the version that the base domain is redirecting to. In this case, you may find that the crawl returns 1 URL, with either a 301 or 302 status code.

This is because DeepCrawl will only crawl URLs that match the base domain. The site may be on https instead of http, or a different TLD, or a different domain altogether.

Similarly, it may be that your site is set to geographically redirect based on IP. An example might be to a different ccTLD or subdomain; meaning that the site once again falls outside the scope of the crawl.

Whenever you enter a URL into the base domain section of Project Settings, DeepCrawl will only crawl URLs on that specific domain: it will not follow links on the redirected URL.

It may be that the content is actually hosted on a subdomain, or even a number of subdomains with only the homepage on the base domain. In this instance, those subdomains won’t be crawled unless specifically requested.

Correcting the Issue with DeepCrawl

In the Project Settings , you can check and confirm whether the base domain entered is correct. It’s important to ensure that:
  • The correct protocol is selected (either http or https)
  • The ‘www.’ is appended if necessary
  • The TLD is correct
If your content is on a subdomain, then you will also need to make sure that the ‘crawl subdomains’ box is checked, so that links to those pages can be followed.

Included URLs

If you have restricted your crawl to one or more specific sections of the site, it could be that it is these restrictions that are leading to the crawl being unable to run.
The ‘Included Only’ setting in DeepCrawl can be used to ensure that we only crawl certain sections of a site, and will ignore all other areas. As part of this, you will need to provide a ‘Start URL’ within that section, unless it is linked to from the base domain.
However, should the links leading from the Start URL not fall within the folder referenced in the ‘Included Only’ setting, then they will be outside the scope of the crawl.

Correcting the Issue with DeepCrawl

The best way to deal with this issue is to change the ‘Start URL’ to a page within the folder that does have internal links to URLs within that section.