Whether your site has a million URLs or only a few thousand, you may not always want to crawl them all. There are several reasons that this could be the case. You might:
Want to save credits, or reduce the time spent crawling;
Be aware that your site may have many low value URLs, which may not contribute towards better SEO. In this case, crawling the entire site might only reveal more of the same issues to you, making a full crawl redundant;
Or have a huge number of total site URLs, and wish to work with a smaller, more manageable segment of data.
You can use the ‘Crawl Limits’ options on step 3 of the the crawl setup, to limit the crawl to a certain number of URLs, or depth of site.
Crawling by Level
A web crawl works by following links to discover new pages at deeper levels of the site. The first level defaults to whichever URL you have specified as the Base Domain. However, you can change the starting point by adding a different URL (or URLs) into the ‘Start URLs’ section within the Advanced Settings.
Any links found on the first page (or pages) are followed and considered to be level 2. Any new pages discovered from these pages are then considered to be level 3, and so on.
How this impacts Crawl Restrictions
When using the ‘Crawl Limits’ section, you will find that there are seven standard options, from 10 right up to 3,000,000 URLs.
You can limit your crawl by total number of URLs. In this case, instead of crawling the site by each full level, it will crawl the total number of URLs that you have entered.
Any additional URLs on this level that were not crawled, will once again be shown as 'Crawl Limit Restricted URLs' in the Uncrawled URLs section of the report dashboard.
Using Custom Crawl Limits
It is also possible to use the ‘Custom’ setting in order to crawl a specific number of URLs.
For example, should you enter 252,222, then the tool will crawl exactly that number of URLs.
Pause or Finalize
Within this section, you can also choose to be notified by email ‘If the limits were not enough’ to complete the crawl or ‘Finish anyway’ when limits are reached. If you choose to finish, then the crawl will automatically complete as normal.