Running frequent and targeted crawls of your website is a key part of improving it’s technical health and improving rankings in organic search. In this guide, you’ll learn how to a crawl a website efficiently and effectively with DeepCrawl. The six steps to crawling a website include:
- Configuring the URL sources
- Understanding the domain structure
- Running a test crawl
- Adding crawl restrictions
- Testing your changes
- Running your crawl
Step 1: Configuring the URL sources
There are six types of URL sources you can include in your DeepCrawl projects.
Including each one strategically, is the key to an efficient, and comprehensive crawl:
- Web crawl: Crawl only the site by following its links to deeper levels.
- Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled.
- Analytics: Upload analytics source data, and crawl the URLs, to discover additional landing pages on your site which may not be linked. The analytics data will be available in various reports.
- Backlinks: Upload backlink source data, and crawl the URLs, to discover additional URLs with backlinks on your site. The backlink data will be available in various reports.
- URL lists: Crawl a fixed list of URLs. Links on these pages will not be followed or crawled.
- Log files: Upload log file summary data from log file analyser tools, such as Splunk and Logz.io.
Ideally, a website should be crawled in full (including every linked URL on the site). However, very large websites, or sites with many architectural problems, may not be able to be fully crawled immediately. It may be necessary to restrict the crawl to certain sections of the site, or limit specific URL patterns (we’ll cover how to do this below).
Step 2: Understanding the Domain Structure
Before starting a crawl, it’s a good idea to get a better understanding of your site’s domain structure:
Check the www/non-www and http/https configuration of the domain when you add the domain.
Identify whether the site is using sub-domains.
If you are not sure about sub-domains, check the DeepCrawl “Crawl Subdomains” option and they will automatically be discovered if they are linked.
Step 3: Running a Test Crawl
Start with a small “Web Crawl,” to look for signs that the site is uncrawlable.
Before starting the crawl, ensure that you have set the “Crawl Limit” to a low quantity. This will make your first checks more efficient, as you won’t have to wait very long to see the results.
Problems to watch for include:
- A high number of URLs returning error codes, such as 401 access denied
URLs returned that are not of the correct subdomain – check that the base domain is correct under “Project Settings”.
- Very low number of URLs found.
- A large number of failed URLs (502, 504, etc).
- A large number of canonicalized URLs.
- A large number of duplicate pages.
- A significant increase in the number of pages found at each level.
To save time, and check for obvious problems immediately, download the URLs during the crawl:
Step 4: Adding Crawl Restrictions
Next, reduce the size of the crawl by identifying anything that can be excluded. Adding restrictions ensures you are not wasting time (or credits) crawling URLs that are not important to you. All the following restrictions can be added within the “Advanced Settings” tab.
If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the “Remove Parameters” field under “Advanced Settings.”
Add Custom Robots.txt Settings
DeepCrawl’s “Robots Overwrite” feature allows you to identify additional URLs that can be excluded using a custom robots.txt file – allowing you to test the impact of pushing a new file to a live environment.
Upload the alternative version of your robots file under “Advanced Settings” and select “Use Robots Override” when starting the crawl:
Filter URLs and URL Paths
Use the “Included/Excluded” URL fields under “Advanced Settings” to limit the crawl to specific areas of interest.
Add Crawl Limits for Groups of Pages
Use the “Page Grouping” feature, under “Advanced Settings,” to restrict the number of URLs crawled for groups of pages based on their URL patterns.
Here, you can add a name.
In the “Page URL Match” column you can add a regular expression.
Add a maximum number of URLs to crawl in the “Crawl Limit” column.
URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the “Page Group Restrictions” report and are not crawled.
Step 5: Testing Your Changes
Run test “Web Crawls” to ensure your configuration is correct and you’re ready to run a full crawl.
Step 6: Running your Crawl
Ensure you’ve increased the “Crawl Limit” before running a more in-depth crawl.
Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data.
If you have specified a subdomain of www within the “Base Domain” setting, subdomains such as blog or default, will not be crawled.
To include subdomains select “Crawl Subdomains” within the “Project Settings” tab.
Set “Scheduling” for your crawls and track your progress.
Settings for Specific Requirements
If you have a test/sandbox site you can run a “Comparison Crawl” by adding your test site domain and authentication details in “Advanced Settings.”
For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website.
To crawl an AJAX-style website, with an escaped fragment solution, use the “URL Rewrite” function to modify all linked URLs to the escaped fragment format.
Read more about our testing features – Testing Development Changes Before Putting Them Live.
Changing Crawl Rate
Watch for performance issues caused by the crawler while running a crawl.
If you see connection errors, or multiple 502/503 type errors, you may need to reduce the crawl rate under “Advanced Settings.”
If you have a robust hosting solution, you may be able to crawl the site at a faster rate.
The crawl rate can be increased at times when the site load is reduced – 4 a.m. for example.
Head to “Advanced Settings” > “Crawl Rate” > “Add Rate Restriction.”
Analyze Outbound Links
Sites with a large quantity of external links, may want to ensure that users are not directed to dead links.
To check this, select “Crawl External Links” under “Project Settings,” adding an HTTP status code next to external links within your report.
Read more on outbound link audits to learn about analyzing and cleaning up external links.
Change User Agent
See your site through a variety of crawlers’ eyes (Facebook/Bingbot etc.) by changing the user agent in “Advanced Settings.”
Add a custom user agent to determine how your website responds.
After The Crawl
Reset your “Project Settings” after the crawl, so you can continue to crawl with ‘real-world’ settings applied.
Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.
Start your journey with DeepCrawl
If you’re interested in running a crawl with DeepCrawl, discover our range of flexible plans or if you want to find out more about our platform simply drop us a message and we’ll get back to you asap.