What does DeepCrawl do?
DeepCrawl is a cloud-based web crawler that you control.
You can set it to crawl your website, staging environment, external sites, analytics data, backlinks, sitemaps and URL lists, with a host of flexible crawl types.
DeepCrawl helps you analyze your website architecture and understand and monitor technical issues, to improve your SEO performance.
You can use DeepCrawl for:
- Technical Auditing
- Site Redevelopment/Migrations
- Website Change Management
- Link Auditing
- Competitor Intelligence
- Landing Page Analysis
- Website Architecture Optimization
- Website Development Testing
- Competitor Analysis
How is DeepCrawl different from other services?
DeepCrawl has been designed by experienced SEOs and used extensively in the field to solve real problems.
The level of detail available is more extensive than most other crawlers and the data is presented in a more digestible, actionable format.
Because DeepCrawl is run as a cloud-based service, the size of crawls that can be run is much larger compared to software based crawlers which run on your local computer. They’re also not impacted or affected in any way by the power of your local machine, or other processes your local machine is running.
There is a very high level of customization and control available for more experienced users, allowing crawls to be tailored to suit a specific project.
How secure is my data?
All data is stored using Amazon Web Services which has been architected to be one of the most secure cloud computing environments available.The crawl data is stored in a database on EC2 servers until the crawl is archived or deleted. The report data and backups are archived in S3.We use a VPN and security groups to prevent unauthorized access to the data.
Can I run multiple crawls at the same time?
Yes, you can run up to 10 crawls simultaneously.
Will DeepCrawl activity affect the stats in my analytics package?
Some older analytics packages use log file data stored on the web server. This data can be affected by any crawling activity, including Google or Bing and therefore DeepCrawl too.
Do I need to add tracking tags or authenticate a site before I can use DeepCrawl?
No, you don’t need to do anything under normal circumstances to crawl a public website. No additional tracking tags or authentication processes are normally required.
It’s important to advise your IT manager or person responsible for hosting your website to avoid your crawl being blocked.
If you want to crawl a private website on your own network e.g. a test website or staging environment, then you will need to allow access to DeepCrawl by identifying the user agent or IP address and allowing access to your network, and/or any basic authentication or DNS configurations needed.
Pricing & Payments
What are Active Projects?
Active projects are those which have had a crawl run in the current billing period. If you have hit your limit, you will only be able to run crawls on the projects which are active, until the next billing period.
The number of inactive projects you can have in your account is unlimited, which means you don’t have to worry about deleting anything.
Is there a limit on the number of websites I can crawl?
We do not limit the number of different domains you can crawl, but we do have a limit on the number of ‘Active’ projects in your account, depending on your package.
Active projects are those which have had a crawl run in the current billing period.
Can I access my reports if I cancel?
Your data will be available until the account expires. To continue using your data, please export it before your account expires.
How do I change my credit or debit card details?
You can change your details from the platform from the Subscription page (app.deepcrawl.com/subscription/) then by clicking on “Payment Information” and then “Update Payment method”.
How do I cancel my monthly plan?
To cancel your DeepCrawl subscription, you can do this on the subscription page in the app (https://app.deepcrawl.com/subscription), click on settings (the cog at the top right) and then the “Remove Subscription” option in the dropdown. Finally, follow the onscreen instructions to cancel. Once cancelled, your Subscription status will change to “Cancelling” and you will still have access to your account until your Subscription expires.
How do I reactivate an old account?
Simply log in to your account and go to your Subscription area. Then click on ‘Reactivate’, located next to your latest package icon.
How do I downgrade my monthly plan?
To downgrade your plan, visit the subscription page (app.deepcrawl.com/subscription), click “Show all plans” and then select the appropriate subscription level.
How do I upgrade my monthly plan?
To upgrade your plan, visit the Subscription page (app.deepcrawl.com/subscription) and click on either a higher off-the-shelf plan or select “Upgrade Now” under “Corporate” to get a Custom quote.
Where can I find my invoices?
All payment related actions can be found under Subscription within the application. To find your invoices, click on ‘Payment Details & Invoices’ button on the subscription screen.
Can I pay via invoice?
Invoice payment is only available for Corporate plans.
Which currency can I pay in?
You can pay in US Dollars, Euros or British Pounds. Our Pricing page displays these currencies based on your location, so simply select the package you want and follow the steps to purchase. You will then be billed in the currency of your choice.
Is there a minimum contract term?
What are the limits for custom extractions?
Each project can have up to 30 separate custom extractions, with up to 20 matches and 64KB of data per extraction.
What is the maximum file size for URL lists, XML Sitemaps, analytics data or backlinks data uploads?
We accept file uploads of up to 100MB for your URL lists, XML Sitemaps, analytics data and backlinks data.
How many credits does the average Universal Crawl consume?
This depends entirely on the website being crawled, and whether any crawl limitations or restrictions have been applied in the Advanced Settings.
Can I pause a live crawl?
You can manually pause a crawl at any point during the ‘Crawling’ phase. This can then be resumed at a later time, but will automatically finalize after 72 hours.
A crawl will pause automatically under certain circumstances (i.e. the necessary options have been selected pre-crawl) if it reaches the set limit or runs out of credits before reaching the limit. In any case, the crawl will remain paused for 72 hours before finalizing automatically.
You can also alter the crawl speed, depth and URL limit of the crawl without needing to pause at all.
In the test site section, does the test site authentication work with IIS?
IIS can support basic authentication and normally works.
Configure Basic Authentication (IIS 7)
Other types of password solution using cookies may be implemented, and these won’t work with DeepCrawl as we do not store cookies.
Yes, DeepCrawl detects and crawls CSS and JS files to check the HTTP status, and reports on broken or disallowed files.
You can change this setting in Advanced Settings > Scope > Crawl Restrictions.
Does DeepCrawl crawl and report on PDF documents for download on my site?
PDF documents are detected if they are linked internally and reported in a list.
If you implement the ‘Check Non-HTML File types’ setting in Advanced Settings, DeepCrawl will check the HTTP status of these links.
Does DeepCrawl detect image Alt tags on my site?
DeepCrawl currently looks at the alt text for linked images, which is displayed in the internal linking data reports.
It is possible to use custom extractions in the advanced settings, to identify empty alt tags on unlinked images.
Does DeepCrawl detect H1/H2 etc tags on my site?
DeepCrawl detects and extracts H1, H2 and H3 tags by default and creates reports on multiple H1s and missing H1s.
DeepCrawl does not detect H4, H5 etc. However, it can be done using DeepCrawl custom extraction. Check out our custom extraction guide to find out how to do it.
Can I get DeepCrawl to obey or ignore my robots.txt file when it crawls my site?
DeepCrawl will obey the robots.txt live on your site, based on the user agent you have selected for the crawl.
You can also use the DeepCrawl Robots Overwrite feature to ignore your current robots.txt file during a crawl, and use the alternative version you have specified.
If DeepCrawl is specifically disallowed in a robots.txt file then we will always respect this (a stealth crawl may allow you to run a successful crawl of the site in this case).
Will DeepCrawl slow down my site when it’s crawling?
Most sites never experience a site slow down whilst using DeepCrawl.
Sometimes sites can experience a slow down if their server capacity is not able to handle user demand or there is an increase in user demand with DeepCrawl running at the same time.
If this is the case, you can control the maximum speed of the crawler to prevent any site performance slow down. You can also optimize your crawl activity further, by increasing your crawl rate during known quiet periods e.g. 1am-5am.
This can all be set with the Crawl Rate restriction settings.
Can I set my crawl to run at certain times or automatically?
With DeepCrawl, you can set your crawl to run at certain times, at certain speeds (URLs per second) and even set up schedules for your crawls e.g. weekly, daily, constant (24 hours) & more.
This can all be set under phase 3 of crawl setup > Crawl Rate Restrictions.
For example, you may want to only run your crawls within a 1am – 5am time window.
So you would restrict the crawl to 0 URLs per second from 5am until 1am, to ensure it would not run any URLs during that time and be restricted to only crawling URLs during the 1am-5am time slot, potentially to avoid your peak traffic hours.
You would select ‘Add Restriction’ and then select 5am to 1am with a crawl rate of 0.
When I use a mobile bot to crawl my website, what changes from a normal Googlebot crawl?
How can I tell if DeepCrawl is crawling my site?
DeepCrawl will always identify itself by including ‘DeepCrawl’ within the user agent string.
See above for a comprehensive list of user agent strings.
What IP address will DeepCrawl requests come from?
By default, requests from the DeepCrawl crawler come from the IP address 22.214.171.124
What user agent does DeepCrawl use to crawl?
DeepCrawl offers a wide range of user agents to use for a crawl including the most common search engines, desktop browsers and mobile devices.
You can also add your own custom user agents.
By default, we crawl as Googlebot, and can be identified by the following string:
Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html) https://deepcrawl.com/bot
Here’s a comprehensive list covering available User Agents and their full strings:
Applebot: [“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1) https://deepcrawl.com/bot”]
Baidu: [“Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html) https://deepcrawl.com/bot”]
Bingbot: [“Mozilla/5.0 (compatible; bingbot/2.0; +https://www.bing.com/bingbot.htm) https://deepcrawl.com/bot”]
Bingbot Mobile: [“Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b https://deepcrawl.com/bot”]
Chrome: [“Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24 https://deepcrawl.com/bot”]
Chrome Mobile: [“Mozilla/5.0 (Linux; Android 7.0; SM-G892A Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.107 Mobile Safari/537.36 https://deepcrawl.com/bot”]
DeepCrawl: [“deepcrawl https://deepcrawl.com/bot”]
Facebook: [“facebookexternalhit/1.1 https://deepcrawl.com/bot”]
Firefox: [“Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:126.96.36.199) Gecko/20091221 Firefox/3.5.7 https://deepcrawl.com/bot”]
Google Web Preview: [“Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13 https://deepcrawl.com/bot”]
Googlebot: [“Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot (legacy): [“Mozilla/5.0 (compatible; Googlebot/2.1; https://deepcrawl.com/bot)”]
Googlebot Smartphone: [“Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +https://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot-Image: [“Googlebot-Image/1.0 https://deepcrawl.com/bot”]
Googlebot-Mobile Feature phone: [“SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/188.8.131.52.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +https://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot-News: [“Googlebot-News https://deepcrawl.com/bot”]
Googlebot-Video: [“Googlebot-Video/1.0 https://deepcrawl.com/bot”]
Internet Explorer 6: [“Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727) https://deepcrawl.com/bot”]
Internet Explorer 8: [“Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0) https://deepcrawl.com/bot”]
Iphone: [“Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 https://deepcrawl.com/bot”]
iPhone X: [“Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1 https://deepcrawl.com/bot”]
Yandex: [“Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots) https://deepcrawl.com/bot”]
How does DeepCrawl handle international character encoding?
DeepCrawl can correctly handle characters in any language, including the content length calculations. URLs with non-Latin characters will be displayed in an unencoded format in the interface and downloads.
Why does my 5xx report show 503 errors that aren’t visible when I check them manually?
This can sometimes be caused by the default crawl speed of 3 URLs per second being too fast for your site servers. So pages are being recorded by DeepCrawl as a 503 error, but they render OK for the average user. In your Crawl Settings, you can reduce the max crawl speed to 1 URL per second to reduce the possibility of 503 errors being reported. This means it will take longer to complete your crawl, but reduces the possibility of your 5xx reports containing the 503 again.
What is the date range of the information shown in Google Analytics?
How is the Duplicate Body Content report different to the Duplicate Pages report?
The Duplicate Body content report shows URLs where DeepCrawl has looked at the text in the body of the page only, whereas results in the Duplicate Pages report are from DeepCrawl analysing the full page including HTML tags. The duplicate body content report can sometimes pick up pages which are similar but have different templates.
Why can I see pages in the Duplicate Pages report that aren’t duplicates?
Duplication is a subjective measure. We have tuned our algorithm to pick up very similar pages as well as identical pages because most people want to see these. Sometimes it picks up false positives which can be ignored. The duplication sensitivity settings can be adjusted in the Report Settings if you want to remove some of the similar pages.
How does DeepCrawl detect duplicate pages?
An exact duplicate page is the easiest to detect, but isn’t very useful, as it misses a lot of ‘similar’ pages.
The DeepCrawl algorithm is tuned to allow a small amount of variation. The algorithm finds pages that are almost identical. We ignore very small differences, because web pages often contain small pieces of dynamic content, such as dates.
We classify duplication within our algorithm as:
- Identical Title
- Close to identical Body Content
Duplicate Body Content
- Close to identical Body Content
- Identical Title
We report the most authoritative page (based on its DeepRank score) as a Primary Duplicate and list it under the Primary Pages section. The page(s) that DeepCrawl considers to be nearly identical (based on the above criteria) and hold less authority will be listed as the duplicates.
In addition, the following page types get excluded from Duplicate reports:
Very occasionally there are false positives, but in the majority of cases the algorithm correctly identifies duplicate pages. DeepCrawl is constantly being fine tuned, so please let us know if you experience a false positive and send us an example to firstname.lastname@example.org.
If you’d like to find out more about how to identify and handle duplicate pages, read our blog post on how URL duplication could be harming your website and how to stop it.
What is DeepRank?
DeepRank is a measurement of internal link weight calculated in a similar way to Google’s basic PageRank algorithm. DeepCrawl stores every internal link and starts by giving each link the same value. It then iterates through all the found links a number of times, to calculate the DeepRank for each page, which is the sum of all link values pointing to the page. With each iteration the values move towards their final value.
It is a signal of authority, and can help to indicate the most important URLs in the current report, or within the entire crawl.
How are issues and changes prioritized?
Every report is assigned a weight, to represent the importance of the issue and it’s potential impact. Reports are also given a sign, either positive, negative, or neutral. The list of issues is filtered to negative reports, and ordered by the number of items in the report, multiplied by the weight. This is why the issues are rarely displayed in numerical order. The changes are ordered by the number of added or removed issues in the report, multiplied by their weight.
How do you report changes in report contents?
In addition to calculating the URLs which are relevant to a report, we also calculate the changes in URLs between crawls. If a URL appears in a report and wasn’t in that report in the previous crawl, it will be included in the ‘Added report’. If the URL was included in the previous crawl, and is present in the current crawl, but is no longer in that specific report, then it is reported in the ‘Removed report’. If the URL was in the previous crawl, but is not included in any report in the current crawl, it is included in the ‘Missing report’ (e.g. the URL may have been unlinked since we last crawled, or may now fall outside of the scope of the crawl).
Do shared report links expire?
You can choose the expiration time-frame from a dropdown of options when sharing the report. These range from 24 hours to 6 months, with our default set as 1 month. Please bear in mind that the online reports are only available in the interface for the most recent crawl.
How long are the reports available?
At the moment, we keep crawl data archived for the lifespan of the client’s account.
What happens to my data when I cancel my account?
Upon account expiry, your account will fall dormant in case you wish to reactivate it at any time. Should you wish to have all of your data permanently deleted, you will need to request this specifically. Please speak to your Account Manager or the Customer Success team at email@example.com.
Does DeepCrawl back up reports and crawl data?
Crawl data, including all tables used to display reports is backed up in Amazon S3 storage, which is Write Once Read Many, and is therefore highly reliable. All user and account data is backed up every hour.
Can I view reports before a crawl is finished?
No. Reports are not available until a crawl has been finalised. This is because the majority of the calculations DeepCrawl performs, such as duplication detection and internal linking analysis, require a complete set of page URLs before they can begin.
What is Automation Hub?
Engineering teams often release code without understanding the full impact it can have on the overall SEO performance of the website.
This creates risk for sudden losses of traffic, rankings and in turn overall revenue.
Deepcrawl’s SEO Automation Hub gives development teams the ability to test their code for SEO impact before pushing to production with a smart, automated and frictionless tool that allows for better collaboration between development & SEO/Marketing teams.
What CI/CD tools does Automation Hub integrate with?
The Automation Hub can connect to all major CI /CD tools.
We provide two ways of integrating – either via the API or through a shell script. We provide full step by step instructions for both scenarios as well as a highly accurate API documentation.
How do I integrate with my CI/CD tool?
The Automation Hub can connect to all major CI /CD tools either via the API or through a shell script and instructions are provided within the setup process.
What are Test Suites?
A Test Suite is a group of tests you wish to run across a specific domain.
You can create multiple Test Suites for each domain should you wish to group them based on different outcomes for example site speed, stability, missing pages.
How many Test Suites can I create?
There are no limits to the number of test suites you can create.
Can you copy Test Suites?
The Automation Hub allows you to easily duplicate Test Suites to save you valuable time.
This means you don’t have to re-enter configuration details for each staging environment. The duplicate will not copy across any URL lists and will automatically be set to a web crawl.
How many URLs does the Automation Hub crawl?
You can choose how many URLs to crawl up to a maximum of 10,000.
What do you mean by Fail or Warning?
The Automation Hub can run over 160 different tests on your domain.
Each test can be set with a severity level of Fail or Warning. A Fail will stop your build and a Warning will send a notification either via Email or Slack but allow the build to continue.
You can apply a threshold against each test and can choose between a percentage or an absolute number.
How do notifications work?
Notifications can be sent at the end of each test via Email or Slack or you can integrate with your Jira backlog and create a customizable ticket for each failed test.
What authentication options are available?
As well as Password authentication you can use custom DNS and / or whitelisted IP.
Can the Automation Hub crawl sites that are behind a firewall?
Yes, we provide a selection of options for users for authentication and white labeling like user / password authentication and custom DNS set up.
Do you have an API?
Yes, the DeepCrawl API is available for all users.
The API key and instructions are available API Access.
Usage is under fair usage policy, but if you have very specific requirements, feel free to run them by us – firstname.lastname@example.org
You can find our current API documentation here:
What technology does DeepCrawl use?
The service is run entirely within the Amazon Web Services cloud computing platform.
Do you have whitelabel options?
Yes, the interface can be white-labeled with your own logo.