How to Crawl a Website with Deepcrawl

Sam Marsden
Sam Marsden

On 18th October 2021 • 9 min read

Running frequent and targeted crawls of your website is a key part of improving its technical health and improving rankings in organic search.

In this guide, you’ll learn how to crawl a website efficiently and effectively with Deepcrawl.

The six steps to crawling a website include:

1. Understanding the domain structure
2. Configuring the URL sources
3. Running a test crawl
4. Adding crawl restrictions
5. Testing your changes
6. Running your crawl
 

Step 1: Understanding the Domain Structure

Before starting a crawl, it’s a good idea to get a better understanding of your site’s domain structure:

Check the www/non-www and http/https configuration of the domain when you add the domain.

You can also enable JavaScript rendering to analyze the rendered version of the pages on your site.

Using Deepcrawl SEO platform to understand your website's domain structure

 
 

Step 2: Configuring the URL sources

There are seven types of URL sources you can include in your Deepcrawl projects. Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data.

  1. Web crawl: Crawl only the site by following its links to deeper levels. 

If you are not sure about sub-domains, check the “Crawl Subdomains” option and they will automatically be discovered if they are linked.

You can also choose to crawl both HTTP and HTTPS pages, to get visibility over pages served over a non-secure protocol.

  1. Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled.
  2. Backlinks: Upload backlink source data, and crawl the URLs, to discover additional URLs with backlinks on your site. The backlink data will be available in various reports.

If our Majestic Integration is part of your package, use it to automatically bring in data such as backlink count and backlink domain count.

  1. Google Search Console: Use our Google Search Console integration to enrich your reports with data such as search impressions, positions on a page, devices used etc. and to discover additional pages on your site which may not be linked. 
  2. Google Analytics: Similarly, you can use our Google Analytics integration or upload analytics source data, and crawl the URLs, to discover additional landing pages on your site which may not be linked. The analytics data will be available in various reports.
  3. Log Files: Upload log file summary data from log file analyzer tools, such as Splunk and Logz.io to get a view of how bots interact with your site. You also have the option to upload log file data manually.
  4. URL Lists: Crawl a fixed list of URLs. Links on these pages will not be followed or crawled.

 

Using the Deepcrawl SEO platform to configure URL sources for a crawl

 
 

Step 3: Running a Test Crawl

Start with a small “Web Crawl,” to look for signs that the site is uncrawlable.

Before starting the crawl, ensure that you have set the “Crawl Limit” to a low quantity. This will make your first checks more efficient, as you won’t have to wait very long to see the results.

Using Deepcrawl SEO platform to run a test crawl on your website

Problems to watch for include:

When you run the test crawl, also watch for performance issues caused by the crawler.

If you see connection errors, or multiple 502/503 type errors, you may need to reduce the crawl rate.
Using Deepcrawl technical SEO platform to test crawl speed
If you have a robust hosting solution, you may be able to crawl the site at a faster rate.

The crawl rate can be increased at times when the site load is reduced – 4 a.m. for example.

For more information about crawl troubleshooting, you can also refer to the article, How to Fix Your Failed Website Crawls.

 
 

Step 4: Adding Crawl Restrictions

One you’ve confirmed that your site is crawlable, you can launch your crawl. 

Ideally, a website should be crawled in full (including every linked URL on the site). However, very large websites, or sites with many architectural problems, may not be able to be fully crawled immediately. It may be necessary to restrict the crawl to certain sections of the site, or limit specific URL patterns. Such restrictions can be added within the “Advanced Settings” tab.

Filter URLs and URL Paths

Use the “Included/Excluded” URL fields to limit the crawl to specific areas of interest. 

Adding restrictions ensures you are not wasting time (or credits) crawling URLs that are not important to you.

How to add crawl restrictions in Deepcrawl

Add Crawl Limits for Groups of Pages

Use the “Page Grouping” feature, under “Advanced Settings,” to restrict the number of URLs crawled for groups of pages based on their URL patterns.

How to group pages in crawl restrictions in Deepcrawl advanced settings

Here, you can add a name.

In the “Page URL Match” column you can add a regular expression.

Add a maximum number of URLs to crawl in the “Crawl Limit” column.

URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the “Page Group Restrictions” report and are not crawled.

Use Robots Overwrite Settings

Deepcrawl’s “Robots Overwrite” feature allows you to identify additional URLs that can be excluded using a custom robots.txt file – allowing you to test the impact of pushing a new file to a live environment.

Upload the alternative version of your robots file under “Advanced Settings” and select “Use Robots Overwrite” when starting the crawl: 

How to change robots overwrite settings in Deepcrawl SEO platform

Remove Parameters

If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the “Remove Parameters” field under “Advanced Settings.”

How to remove URL parameters in Deepcrawl advanced settings
 

Step 5: Testing Your Changes

Run one more test crawl to ensure your configuration is correct and you’re ready to run a full crawl.

 
 

Step 6: Running your Crawl

Ensure you’ve increased the “Crawl Limit” before running a more in-depth crawl.

Set “Scheduling” for your crawls and track your progress.

How to run schedule crawls and compare results to previous crawls in Deepcrawl SEO platform

 
 

Handy Tips

Settings for Specific Requirements

If you have a test/sandbox site you can run a “Comparison Crawl” by adding your test site domain and authentication details in “Advanced Settings.”

For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website.

To crawl an AJAX-style website, with an escaped fragment solution, use the “URL Rewrite” function to modify all linked URLs to the escaped fragment format.

Analyze Outbound Links

Sites with a large quantity of external links may want to ensure that users are not directed to dead links.

To check this, select “Crawl External Links” under “Advanced Settings,” this will add an HTTP status code next to external links within your report.

How to set URL restrictions in Deepcrawl for technical SEO crawls

Change User Agent

See your site through a variety of crawlers’ eyes (Googlebot Smartphone, Googlebot etc.) by changing the user agent in “Advanced Settings.”

Add a custom user agent to determine how your website responds.

How to change the user agent in Deepcrawl technical SEO crawl settings

 

Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.

 

 

Author

Sam Marsden
Sam Marsden

Sam Marsden is Deepcrawl's Former SEO & Content Manager. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

Choose a better way to grow

With tools that will help you realize your website’s true potential, and support to help you get there, growing your enterprise business online has never been so simple.

Book a Demo