How to Crawl Subdomains in DeepCrawl

Adam Gent
Adam Gent

On 2nd September 2019 • 8 min read

DeepCrawl allows users to crawl their entire website, including subdomains. However, crawling subdomains requires advanced settings to be configured in order to allow DeepCrawl to access URLs on subdomains.

In this guide, we will take you through how to set up and configure our advanced settings to crawl either all or specific subdomains.
 

Select the Primary Domain

First select the primary domain in step 1 of the project settings.

Primary Domain Settings

By default, the domain (www or non-www) listed as the primary domain defines the domain scope and crawl start URL.
 

Check your domain mappings

Once you have added your primary domain you should use the “Check” domain mapping tool. The “Check” tool quickly checks the HTTP status codes for different variations of the URL in the primary domain.

DeepCrawl Project Settings Step 1

Essentially this allows users to quickly see if the URL they inputted into the primary domain is the canonical version, or if there are other duplicate versions of the URL which could also be crawled by DeepCrawl.

If the URL variations are live (200 status code) then we recommend investigating to see if they need to be crawled by DeepCrawl. If they need to be crawled, then follow the next steps.
 

Allowing DeepCrawl to crawl all subdomains

If you select the ‘Crawl subdomains’ and ‘Crawl both HTTP and HTTPS’ options within the Website crawl source in step 2, DeepCrawl will automatically include all URLs (both HTTP and HTTPS URLs) on any subdomains of your primary domain which are discovered during the crawl.

Website Crawl Source

For example, if this was selected for the primary domain https://www.deepcrawl.com/ then all HTTP and HTTPS URLs found in the crawl for the following subdomains could be crawled:

DeepCrawl would not crawl any other subdomains which were not under the primary domain, for example:

The only issue with allowing all subdomains to be crawled, is you could end up using URL credits on pages which do not need to be crawled.

If you want DeepCrawl to only crawl specific subdomain URLs (HTTP or HTTPS) then you will need to use the Secondary Domains feature in step 4 of the project settings.
 

Add secondary domains

To include specific subdomains in your crawl, add them to the 'Secondary Domains' section under Scope within Advanced Settings.

Secondary Domains

You can now enter as many domains or subdomains as you like. For example, your primary domain (https://www.example.com/) might include content which is located on the blog (https://blog.example.com ) and documentation subdomains (https://docs.example.com).

If these subdomains are linked to from the primary domain, then using the secondary domain feature allows our crawler to access the URLs on the subdomains. This makes sure that DeepCrawl can crawl and process unique metrics (e.g. DeepRank) on the entire website not just the primary domain.
 

Restricting crawls on subdomains

To restrict the URLs on subdomains which DeepCrawl crawls you can also use the 'Included Only URLs' and 'Excluded URLs' filters in Advanced Settings (e.g. exclude parameter URLs on marketing subdomain).

DeepCrawl Exclude URL Feature

For further information on positive and negative URL restrictions in your crawl set up, check out our Restricting a Crawl guide.
 

Frequently Asked Questions

Do I need to add a separate m. subdomain to the secondary domains feature?

If you want to see if your mobile and desktop websites are technically excellent, we recommend using the Mobile Site feature in the advanced project settings.

For more information on how to use this feature please read our guide on how to crawl a mobile website with DeepCrawl.

How do I find a list of subdomains which were discovered in a crawl?

A list of subdomains can be found in the “Linked Domains” report.

DeepCrawl Table Report

Note: Remember to click the “Change Layout” option to alter the format into a table to view the list of domains.

This will show you all the external domains and subdomains which were discovered to have links pointing to them on crawled URLs.

Use this report to view which subdomains should be added to the crawl to get a better picture of the entire root domain.

How do I find a list of crawled URLs from subdomains DeepCrawl?

The easiest way to see URLs from subdomains which were crawled is to use a filter.

Go to the ‘All Pages’ report and use the filter to exclude all URLs on the primary domain:

DeepCrawl URL Filter

This can be done for both www and non-www primary domains. This filter will show you all subdomain pages found in the crawl.

How do I crawl images if they are on a subdomain?

Enter the specific image subdomain into the Secondary Domain feature in step 4 or tick the “Crawl sub-domains” checkbox (this will allow all subdomains to be crawled under the primary domain).

DeepCrawl Seconday Domain

Also, make sure that the image resources are ticked under the “Resource Inclusions” settings.

DeepCrawl Resource Inclusions

Will links found on the subdomains be counted as internal?

Yes, all subdomains included in the Secondary Domains or found as part of the primary domain will count as internal links.

Are subdomains counted in the DeepRank score?

Yes. Any internal links pointing towards an internal HTML page will be counted in the DeepRank score of a page.

Will subdomains count towards the number of credits I can use?

Yes. When DeepCrawl fetches any URL resource (HTML, CSS, JavaScript, etc.) it uses a URL credit.
 

Any questions about crawling subdomains?

If you have any further questions about Google Analytics and DeepCrawl don’t hesitate to get in touch.

Author

Adam Gent
Adam Gent

Search Engine Optimisation (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

Subscribe today