DeepCrawl is a cloud-based web crawler that you control.
You can set it to crawl your website, staging environment, external sites, analytics data, backlinks, sitemaps and URL lists, with a host of flexible crawl types.
DeepCrawl helps you analyze your website architecture and understand and monitor technical issues, to improve your SEO performance.
You can use DeepCrawl for:
- Technical Auditing
- Site Redevelopment/Migrations
- Website Change Management
- Link Auditing
- Competitor Intelligence
- Landing Page Analysis
- Website Architecture Optimization
- Website Development Testing
- Competitor Analysis
DeepCrawl has been designed by experienced SEOs and used extensively in the field to solve real problems.
The level of detail available is more extensive than most other crawlers and the data is presented in a more digestible, actionable format.
Because DeepCrawl is run as a cloud-based service, the size of crawls that can be run is much larger compared to software based crawlers which run on your local computer. They’re also not impacted or affected in any way by the power of your local machine, or other processes your local machine is running.
There is a very high level of customization and control available for more experienced users, allowing crawls to be tailored to suit a specific project.
Some older analytics packages use log file data stored on the web server. This data can be affected by any crawling activity, including Google or Bing and therefore DeepCrawl too.
Yes, you can run up to 10 crawls simultaneously.
All data is stored using Amazon Web Services which has been architected to be one of the most secure cloud computing environments available.
The crawl data is stored in a database on EC2 servers until the crawl is archived or deleted. The report data and backups are archived in S3.
We use a VPN and security groups to prevent unauthorized access to the data.
Pricing & Payments
Our plans are available on a monthly basis. You can also purchase a one-off add-on for your monthly plan in-platform should you run out of credits and need more to complete a project, as well as purchase recurring project and/or URL add-ons.
Our Starter and Consultant packages include no minimum contract term, so you can pay on a month-by-month basis. However, with our Corporate package there is a minimum commitment of 12 months.
The Add-on credits are valid for 1 month from the date of purchase.
You can pay in US Dollars, Euros or British Pounds. Select your preferred currency at the top right hand side of the pricing page, then select the package you want, click buy and follow the steps. You will then be billed in the currency of your choice.
Please contact us on firstname.lastname@example.org and specify the package that you are interested in purchasing.
All payment related actions can be found under Subscription within the application. To find your invoices, click on ‘Payment Details & Invoices’ button on the subscription screen.
The simplest way to do this is to go to Subscription and click the ‘Upgrade’ button under the Credits section. If you have any trouble with this please contact your Account Manager or email our Customer Success team at email@example.com.
The simplest way to do this is to go to Subscription and click the ‘Downgrade’ button under the Credits section. If you have any trouble with this please contact your Account Manager or email our Customer Success team at firstname.lastname@example.org.
Simply log in to your account and go to your Subscription area. Then click on ‘Reactivate’, located next to your latest package icon.
Log into your account, go to Subscription and click ‘Buy Credits’ under the Credits section. These add-on credits will last for 30 days from the date of purchase.
If you pay via PayPal, the simplest way to cancel your existing subscription is via these instructions in your PayPal account.
If you pay via credit or debit card, most high street banks allow you to cancel direct debits via your online banking. Alternatively, call your bank or drop us an email at email@example.com and we’ll cancel it from our side.
Your remaining credit allocation will be available until the expiry date. We must warn you that you will only have access to the DeepCrawl interface until this date. If you wish to continue to use your data moving forward, please use the export functions before your account expires.
If you have any trouble with this please contact your Account Manager or email our Customer Success team at firstname.lastname@example.org.
If you are paying via PayPal, log in to your PayPal account, click on MyPayPal, select Wallet, and from there you can choose the card details you wish to amend.
If you are paying directly via credit or debit card, you can contact your bank and amend the card details for your direct debit or standing order from there. Alternatively, you can change your details from within the platform via your Subscription area, or email email@example.com and we’ll do it for you.
Your data will be available until the account expires. To continue using your data, please export it before your account expires.
We do not limit the number of different domains you can crawl, but we do have a limit on the number of ‘Active’ projects in your account, depending on your package.
Active projects are those which have had a crawl run in the current billing period.
Active projects are those which have had a crawl run in the current billing period. If you have hit your limit, you will only be able to run crawls on the projects which are active, until the next billing period.
If you need to increase this, you can purchase more Active Projects for your account with add-ons.
The number of inactive projects you can have in your account is unlimited, which means you don’t have to worry about deleting anything.
Yes, the DeepCrawl API is available for all users.
The API key and instructions are available API Access.
Usage is under fair usage policy, but if you have very specific requirements, feel free to run them by us – firstname.lastname@example.org
You can find our current API documentation here:
Yes, the interface can be white-labeled with your own logo.
The service is run entirely within the Amazon Web Services cloud computing platform.
DeepCrawl offers a wide range of user agents to use for a crawl including the most common search engines, desktop browsers and mobile devices.
You can also add your own custom user agents.
By default, we crawl as Googlebot, and can be identified by the following string:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot
Here’s a comprehensive list covering available User Agents and their full strings:
Applebot: [“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1) https://deepcrawl.com/bot”]
Baidu: [“Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) https://deepcrawl.com/bot”]
Bingbot: [“Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) https://deepcrawl.com/bot”]
Bingbot Mobile: [“Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b https://deepcrawl.com/bot”]
Chrome: [“Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24 https://deepcrawl.com/bot”]
Chrome Mobile: [“Mozilla/5.0 (Linux; Android 7.0; SM-G892A Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.107 Mobile Safari/537.36 https://deepcrawl.com/bot”]
DeepCrawl: [“deepcrawl https://deepcrawl.com/bot”]
Facebook: [“facebookexternalhit/1.1 https://deepcrawl.com/bot”]
Firefox: [“Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:126.96.36.199) Gecko/20091221 Firefox/3.5.7 https://deepcrawl.com/bot”]
Google Web Preview: [“Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13 https://deepcrawl.com/bot”]
Googlebot: [“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot (legacy): [“Mozilla/5.0 (compatible; Googlebot/2.1; https://deepcrawl.com/bot)”]
Googlebot Smartphone: [“?Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot-Image: [“Googlebot-Image/1.0 https://deepcrawl.com/bot”]
Googlebot-Mobile Feature phone: [“SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/188.8.131.52.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot”]
Googlebot-News: [“Googlebot-News https://deepcrawl.com/bot”]
Googlebot-Video: [“Googlebot-Video/1.0 https://deepcrawl.com/bot”]
Internet Explorer 6: [“Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727) https://deepcrawl.com/bot”]
Internet Explorer 8: [“Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0) https://deepcrawl.com/bot”]
Iphone: [“Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 https://deepcrawl.com/bot”]
iPhone X: [“Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1 https://deepcrawl.com/bot”]
Yandex: [“Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) https://deepcrawl.com/bot”]
By default, requests from the DeepCrawl crawler come from the IP address 184.108.40.206
If you need to whitelist us and want a private IP address, you may set up a custom proxy in your account settings.
DeepCrawl will always identify itself by including ‘DeepCrawl’ within the user agent string.
See above for a comprehensive list of user agent strings.
With DeepCrawl, you can set your crawl to run at certain times, at certain speeds (URLs per second) and even set up schedules for your crawls e.g. weekly, daily, constant (24 hours) & more.
This can all be set under phase 3 of crawl setup > Crawl Rate Restrictions.
For example, you may want to only run your crawls within a 1am – 5am time window.
So you would restrict the crawl to 0 URLs per second from 5am until 1am, to ensure it would not run any URLs during that time and be restricted to only crawling URLs during the 1am-5am time slot, potentially to avoid your peak traffic hours.
You would select ‘Add Restriction’ and then select 5am to 1am with a crawl rate of 0.
Most sites never experience a site slow down whilst using DeepCrawl.
Sometimes sites can experience a slow down if their server capacity is not able to handle user demand or there is an increase in user demand with DeepCrawl running at the same time.
If this is the case, you can control the maximum speed of the crawler to prevent any site performance slow down. You can also optimize your crawl activity further, by increasing your crawl rate during known quiet periods e.g. 1am-5am.
This can all be set with the Crawl Rate restriction settings.
DeepCrawl will obey the robots.txt live on your site, based on the user agent you have selected for the crawl.
You can also use the DeepCrawl Robots Overwrite feature to ignore your current robots.txt file during a crawl, and use the alternative version you have specified.
If DeepCrawl is specifically disallowed in a robots.txt file then we will always respect this (a stealth crawl may allow you to run a successful crawl of the site in this case).
PDF documents are detected if they are linked internally and reported in a list.
If you implement the ‘Check Non-HTML File types’ setting in Advanced Settings, DeepCrawl will check the HTTP status of these links.
Yes, DeepCrawl detects and crawls CSS and JS files to check the HTTP status, and reports on broken or disallowed files.
You can change this setting in Advanced Settings > Scope > Crawl Restrictions.
IIS can support basic authentication and normally works.
Configure Basic Authentication (IIS 7)
Other types of password solution using cookies may be implemented, and these won’t work with DeepCrawl as we do not store cookies.
You can manually pause a crawl at any point during the ‘Crawling’ phase. This can then be resumed at a later time, but will automatically finalize after 72 hours.
A crawl will pause automatically under certain circumstances (i.e. the necessary options have been selected pre-crawl) if it reaches the set limit or runs out of credits before reaching the limit. In any case, the crawl will remain paused for 72 hours before finalizing automatically.
You can also alter the crawl speed, depth and URL limit of the crawl without needing to pause at all.
This depends entirely on the website being crawled, and whether any crawl limitations or restrictions have been applied in the Advanced Settings.
Each project can have up to 30 separate custom extractions, with up to 20 matches and 64KB of data per extraction.
No. Reports are not available until a crawl has been finalised.
This is because the majority of the calculations DeepCrawl performs, such as duplication detection and internal linking analysis, require a complete set of page URLs before they can begin.
Crawl data, including all tables used to display reports is backed up in Amazon S3 storage, which is Write Once Read Many, and is therefore highly reliable. All user and account data is backed up every hour.
Upon account expiry, your account will fall dormant in case you wish to reactivate it at any time. Should you wish to have all of your data permanently deleted, you will need to request this specifically. Please speak to your Account Manager or the Customer Success team at email@example.com.
At the moment, we keep crawl data archived for the lifespan of the client’s account.
In addition to calculating the URLs which are relevant to a report, we also calculate the changes in URLs between crawls. If a URL appears in a report and wasn’t in that report in the previous crawl, it will be included in the ‘Added report’. If the URL was included in the previous crawl, and is present in the current crawl, but is no longer in that specific report, then it is reported in the ‘Removed report’. If the URL was in the previous crawl, but is not included in any report in the current crawl, it is included in the ‘Missing report’ (e.g. the URL may have been unlinked since we last crawled, or may now fall outside of the scope of the crawl).
Every report is assigned a weight, to represent the importance of the issue and it’s potential impact. Reports are also given a sign, either positive, negative, or neutral. The list of issues is filtered to negative reports, and ordered by the number of items in the report, multiplied by the weight. This is why the issues are rarely displayed in numerical order. The changes are ordered by the number of added or removed issues in the report, multiplied by their weight.
DeepRank is a measurement of internal link weight calculated in a similar way to Google’s basic PageRank algorithm. DeepCrawl stores every internal link and starts by giving each link the same value. It then iterates through all the found links a number of times, to calculate the DeepRank for each page, which is the sum of all link values pointing to the page. With each iteration the values move towards their final value.
It is a signal of authority, and can help to indicate the most important URLs in the current report, or within the entire crawl.
An exact duplicate page is the easiest to detect, but isn’t very useful, as it misses a lot of ‘similar’ pages.
The DeepCrawl algorithm is tuned to allow a small amount of variation. The algorithm finds pages that are almost identical. We ignore very small differences, because web pages often contain small pieces of dynamic content, such as dates.
We classify duplication within our algorithm as:
- Identical Title
- Close to identical Body Content
Duplicate Body Content
- Close to identical Body Content
- Identical Title
We report the most authoritative page (based on its DeepRank score) as a Primary Duplicate and list it under the Primary Pages section. The page(s) that DeepCrawl considers to be nearly identical (based on the above criteria) and hold less authority will be listed as the duplicates.
In addition, the following page types get excluded from Duplicate reports:
Very occasionally there are false positives, but in the majority of cases the algorithm correctly identifies duplicate pages. DeepCrawl is constantly being fine tuned, so please let us know if you experience a false positive and send us an example to firstname.lastname@example.org.
If you’d like to find out more about how to identify and handle duplicate pages, read our blog post on how URL duplication could be harming your website and how to stop it.
Duplication is a subjective measure.
We have tuned our algorithm to pick up very similar pages as well as identical pages because most people want to see these. Sometimes it picks up false positives which can be ignored.
The duplication sensitivity settings can be adjusted in the Report Settings if you want to remove some of the similar pages.
The Duplicate Body content report shows URLs where DeepCrawl has looked at the text in the body of the page only, whereas results in the Duplicate Pages report are from DeepCrawl analysing the full page including HTML tags.
The duplicate body content report can sometimes pick up pages which are similar but have different templates.
This can sometimes be caused by the default crawl speed of 3 URLs per second being too fast for your site servers. So pages are being recorded by DeepCrawl as a 503 error, but they render OK for the average user.
In your Crawl Settings, you can reduce the max crawl speed to 1 URL per second to reduce the possibility of 503 errors being reported.
This means it will take longer to complete your crawl, but reduces the possibility of your 5xx reports containing the 503 again.
DeepCrawl can correctly handle characters in any language, including the content length calculations.
URLs with non-Latin characters will be displayed in an unencoded format in the interface and downloads.
This is usually a problem with emails being blocked as potential spam. Please check your spam folder. If possible, try whitelisting the DeepCrawl email address.
This is usually caused by a project configuration issue.
Check that the domain you are trying to crawl is working properly, and the crawler is not disallowed in the site’s robots.txt file.
You can try changing the IP settings or user agent which can sometimes resolve the problem.
It may be the case that DeepCrawl is being blocked by the site server, and in this case, whitelisting a static IP address will allow us to run a successful crawl.