General

DeepCrawl is a cloud-based web crawler that you control.

You can set it to crawl your website, staging environment, external sites, analytics data, backlinks, sitemaps and URL lists, with a host of flexible crawl types.

DeepCrawl helps you analyze your website architecture and understand and monitor technical issues, to improve your SEO performance.

You can use DeepCrawl for:

  • Technical Auditing
  • Site Redevelopment/Migrations
  • Website Change Management
  • Link Auditing
  • Competitor Intelligence
  • Landing Page Analysis
  • Website Architecture Optimization
  • Website Development Testing
  • Competitor Analysis

DeepCrawl has been designed by experienced SEOs and used extensively in the field to solve real problems.

The level of detail available is more extensive than most other crawlers and the data is presented in a more digestible, actionable format.

Because DeepCrawl is run as a cloud-based service, the size of crawls that can be run is much larger compared to software based crawlers which run on your local computer. They’re also not impacted or affected in any way by the power of your local machine, or other processes your local machine is running.

There is a very high level of customization and control available for more experienced users, allowing crawls to be tailored to suit a specific project.

No, you don’t need to do anything under normal circumstances to crawl a public website. No additional tracking tags or authentication processes are normally required.

It’s important to advise your IT manager or person responsible for hosting your website to avoid your crawl being blocked.

If you want to crawl a private website on your own network e.g. a test website or staging environment, then you will need to allow access to DeepCrawl by identifying the user agent or IP address and allowing access to your network, and/or any basic authentication or DNS configurations needed.

The majority of analytics packages such as Google Analytics, Webtrends or Omniture use a JavaScript tracking tag which runs inside the user’s browser. DeepCrawl does not run any JavaScript from pages that are crawled and will not affect your analytics data.

Some older analytics packages use log file data stored on the web server. This data can be affected by any crawling activity, including Google or Bing and therefore DeepCrawl too.

Yes, you can run up to 10 crawls simultaneously.

All data is stored using Amazon Web Services.

The AWS cloud infrastructure has been architected to be one of the most secure cloud computing environments available.

Back to top

Pricing & Payments

Our plans are available on a monthly basis. You can also purchase a one-off add-on for your monthly plan in-platform should you run out of credits and need more to complete a project, as well as purchase recurring project and/or URL add-ons.

No, you can pay on a month-by-month basis.

The Add-on credits are valid for 1 month from the date of purchase.

You can pay in US Dollars, Euros or British Pounds. Select your preferred currency at the top right hand side of the pricing page, then select the package you want, click buy and follow the steps. You will then be billed in the currency of your choice.

Please contact us on accounts@deepcrawl.com and specify the package that you are interested in purchasing.

All payment related actions can be found under Subscription within the application. To find your invoices, click on ‘Payment Details & Invoices’ button on the subscription screen.

The simplest way to do this is to go to Subscription and click the ‘Upgrade’ button under the Credits section. If you have any trouble with this please contact your Account Manager or email our Customer Success team at success@deepcrawl.com.

The simplest way to do this is to go to Subscription and click the ‘Downgrade’ button under the Credits section. If you have any trouble with this please contact your Account Manager or email our Customer Success team at success@deepcrawl.com.

Simply log in to your account and go to your Subscription area. Then click on ‘Reactivate’, located next to your latest package icon.

Log into your account, go to Subscription and click ‘Buy Credits’ under the Credits section. These add-on credits will last for 30 days from the date of purchase.

If you pay via PayPal, the simplest way to cancel your existing subscription is via these instructions in your PayPal account.

If you pay via credit or debit card, most high street banks allow you to cancel direct debits via your online banking. Alternatively, call your bank or drop us an email at orders@deepcrawl.com and we’ll cancel it from our side.

Your remaining credit allocation will be available until the expiry date. We must warn you that you will only have access to the DeepCrawl interface until this date. If you wish to continue to use your data moving forward, please use the export functions before your account expires.

If you have any trouble with this please contact your Account Manager or email our Customer Success team at success@deepcrawl.com.

If you are paying via PayPal, log in to your PayPal account, click on MyPayPal, select Wallet, and from there you can choose the card details you wish to amend.

If you are paying directly via credit or debit card, you can contact your bank and amend the card details for your direct debit or standing order from there. Alternatively, you can change your details from within the platform via your Subscription area, or email sales@deepcrawl.com and we’ll do it for you.

Your data will be available until the account expires. To continue using your data, please export it before your account expires.

We do not limit the number of different domains you can crawl, but we do have a limit on the number of ‘Active’ projects in your account, depending on your package.

Active projects are those which have had a crawl run in the current billing period.

Active projects are those which have had a crawl run in the current billing period. If you have hit your limit, you will only be able to run crawls on the projects which are active, until the next billing period.

If you need to increase this, you can purchase more Active Projects for your account with add-ons.

The number of inactive projects you can have in your account is unlimited, which means you don’t have to worry about deleting anything.

Back to top

API

Yes, the DeepCrawl API is available for all users.

The API key and instructions are available API Access.

Usage is under fair usage policy, but if you have very specific requirements, feel free to run them by us – support@deepcrawl.co.uk

You can find our current API documentation here:

API Documentation

Yes, the interface can be fully white-labeled with your own logo and custom color scheme.

We use the Ruby programming language to process data and generate HTML and JavaScript for the interface.

The service is run entirely within the Amazon Web Services cloud computing platform.

Back to top

Crawling

DeepCrawl offers a wide range of user agents to use for a crawl including the most common search engines, desktop browsers and mobile devices.

You can also add your own custom user agents.

By default, we crawl as Googlebot, and can be identified by the following string:

Mozilla/5.0 (compatible; Googlebot/2.1; https://deepcrawl.com/bot)

 

Here’s a comprehensive list covering available User Agents and their full strings:

Applebot: [“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko, https://deepcrawl.com/bot) Version/8.0.2 Safari/600.2.5 (Applebot/0.1)”]

 

Bingbot: [“Mozilla/5.0 (compatible; bingbot/2.0; +https://deepcrawl.com/bot)”]

 

Bingbot Mobile: [“Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b https://deepcrawl.com/bot”]

 

Chrome: [“Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko https://deepcrawl.com/bot) Chrome/11.0.696.16 Safari/534.24”]

 

DeepCrawl: [“deepcrawl (https://deepcrawl.com/bot)”]

 

Facebook: [“facebookexternalhit/1.1 (https://deepcrawl.com/bot)”]

 

Firefox: [“Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.7 https://deepcrawl.com/bot) Gecko/20091221 Firefox/3.5.7”]

 

Google Web Preview: [“Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview https://deepcrawl.com/bot) Version/3.1 Safari/525.13”]

 

Googlebot: [“Mozilla/5.0 (compatible; Googlebot/2.1; https://deepcrawl.com/bot)”]

 

Googlebot-Image: [“Googlebot-Image/1.0 (https://deepcrawl.com/bot)”]

 

Googlebot-Mobile Feature phone: [“Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +https://deepcrawl.com/bot)”]

 

Googlebot-News: [“Googlebot-News (https://deepcrawl.com/bot)”]

 

Googlebot Smartphone: [“Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +https://deepcrawl.com/bot)”]

 

Googlebot-Video: [“Googlebot-Video/1.0 (https://deepcrawl.com/bot)”]

 

Internet Explorer 8: [“Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0 https://deepcrawl.com/bot)”]

 

Internet Explorer 6: [“Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727 https://deepcrawl.com/bot)”]

 

iPhone: [“Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 https://deepcrawl.com/bot”]

 

Requests from the DeepCrawl crawler can come from a variety of different IP addresses. We crawl from AWS IPs and use a Dynamic IP by default.

Our Static IPs can be viewed in the Account area of DeepCrawl. Once selected, this IP will be used for all of your crawls with the Static IP option selected.

 

DeepCrawl will always identify itself by including ‘DeepCrawl’ within the user agent string.

See above for a comprehensive list of user agent strings.

Only the user agent we use to make the HTTP requests changes. If the server is set up to respond differently to a mobile user agent, then DeepCrawl will receive this information and report it.

With DeepCrawl, you can set your crawl to run at certain times, at certain speeds (URLs per second) and even set up schedules for your crawls e.g. weekly, daily, constant (24 hours) & more.

This can all be set under  phase 3 of crawl setup > Crawl Rate Restrictions.

For example, you may want to only run your crawls within a 1am – 5am time window.

So you would restrict the crawl to 0 URLs per second from 5am until 1am, to ensure it would not run any URLs during that time and be restricted to only crawling URLs during the 1am-5am time slot, potentially to avoid your peak traffic hours.

You would select ‘Add Restriction’ and then select 5am to 1am with a crawl rate of 0.

Most sites never experience a site slow down whilst using DeepCrawl.

Sometimes sites can experience a slow down if their server capacity is not able to handle user demand or there is an increase in user demand with DeepCrawl running at the same time.

If this is the case, you can control the maximum speed of the crawler to prevent any site performance slow down. You can also optimize your crawl activity further, by increasing your crawl rate during known quiet periods e.g. 1am-5am.

This can all be set with the Crawl Rate restriction settings.

DeepCrawl will obey the robots.txt live on your site, based on the user agent you have selected for the crawl.

You can also use the DeepCrawl Robots Overwrite feature to ignore your current robots.txt file during a crawl, and use the alternative version you have specified.

If DeepCrawl is specifically disallowed in a robots.txt file then we will always respect this (a stealth crawl may allow you to run a successful crawl of the site in this case).

DeepCrawl detects and extracts H1, H2 and H3 tags by default and creates reports on multiple H1s and missing H1s.

DeepCrawl does not detect H4, H5 etc. However, it can be done using DeepCrawl custom extraction. Check out our custom extraction guide to find out how to do it.

DeepCrawl currently looks at the alt text for linked images, which is displayed in the internal linking data reports.

It is possible to use custom extractions in the advanced settings, to identify empty alt tags on unlinked images.

PDF documents are detected if they are linked internally and reported in a list.

If you implement the ‘Check Non-HTML File types’ setting in Advanced Settings, DeepCrawl will check the HTTP status of these links.

Yes, DeepCrawl detects and crawls CSS and JS files to check the HTTP status, and reports on broken or disallowed files.

You can change this setting in Advanced Settings > Scope > Crawl Restrictions.

DeepCrawl doesn’t crawl CSS and JavaScript. In fact, DeepCrawl doesn’t crawl any non-HTML files.

IIS can support basic authentication and normally works.

Configure Basic Authentication (IIS 7)

Other types of password solution using cookies may be implemented, and these won’t work with DeepCrawl as we do not store cookies.

You can manually pause a crawl at any point during the ‘Crawling’ phase. This can then be resumed at a later time, but will automatically finalize after 72 hours.

A crawl will pause automatically under certain circumstances (i.e. the necessary options have been selected pre-crawl) if it reaches the set limit or runs out of credits before reaching the limit. In any case, the crawl will remain paused for 72 hours before finalizing automatically.

You can also alter the crawl speed, depth and URL limit of the crawl without needing to pause at all.

This depends entirely on the website being crawled, and whether any crawl limitations or restrictions have been applied in the Advanced Settings.

Each project can have up to 30 separate custom extractions, with up to 20 matches and 64KB of data per extraction.

Back to top

Reports

No. Reports are not available until a crawl has been finalised.

This is because the majority of the calculations DeepCrawl performs, such as duplication detection and internal linking analysis, require a complete set of page URLs before they can begin.

Crawl data, including all tables used to display reports is backed up in Amazon S3 storage, which is Write Once Read Many, and is therefore highly reliable. All user and account data is backed up every hour.

Upon account expiry, your account will fall dormant in case you wish to reactivate it at any time. Should you wish to have all of your data permanently deleted, you will need to request this specifically. Please speak to your Account Manager or the Customer Success team at success@deepcrawl.com.

At the moment, we keep crawl data archived for the lifespan of the client’s account.

In addition to calculating the URLs which are relevant to a report, we also calculate the changes in URLs between crawls. If a URL appears in a report and wasn’t in that report in the previous crawl, it will be included in the ‘Added report’. If the URL was included in the previous crawl, and is present in the current crawl, but is no longer in that specific report, then it is reported in the ‘Removed report’. If the URL was in the previous crawl, but is not included in any report in the current crawl, it is included in the ‘Missing report’ (e.g. the URL may have been unlinked since we last crawled, or may now fall outside of the scope of the crawl).

Every report is assigned a weight, to represent the importance of the issue and it’s potential impact. Reports are also given a sign, either positive, negative, or neutral. The list of issues is filtered to negative reports, and ordered by the number of items in the report, multiplied by the weight. This is why the issues are rarely displayed in numerical order. The changes are ordered by the number of added or removed issues in the report, multiplied by their weight.

DeepRank is a measurement of internal link weight calculated in a similar way to Google’s basic PageRank algorithm. DeepCrawl stores every internal link and starts by giving each link the same value. It then iterates through all the found links a number of times, to calculate the DeepRank for each page, which is the sum of all link values pointing to the page. With each iteration the values move towards their final value.

It is a signal of authority, and can help to indicate the most important URLs in the current report, or within the entire crawl.

An exact duplicate page is the easiest to detect, but isn’t very useful, as it misses a lot of ‘similar’ pages.

The DeepCrawl algorithm is tuned to allow a small amount of variation. The algorithm finds pages that are almost identical. We ignore very small differences, because web pages often contain small pieces of dynamic content, such as dates.

We classify duplication within our algorithm as:

Duplicate Pages

  1. Identical Title
  2. Close to identical Body Content

Duplicate Body Content

  1. Close to identical Body Content

 

Duplicate Titles

  1. Identical Title

 

We report the most authoritative page (based on its DeepRank score) as a Primary Duplicate and list it under the Primary Pages section. The page(s) that DeepCrawl considers to be nearly identical (based on the above criteria) and hold less authority will be listed as the duplicates.

In addition, the following page types get excluded from Duplicate reports:

  1. Disallowed
  2. Noindex
  3. Canonicalized

Very occasionally there are false positives, but in the majority of cases the algorithm correctly identifies duplicate pages. DeepCrawl is constantly being fine tuned, so please let us know if you experience a false positive and send us an example to support@deepcrawl.com.

 

If you’d like to find out more about how to identify and handle duplicate pages, read our blog post on how URL duplication could be harming your website and how to stop it.

Duplication is a subjective measure.

We have tuned our algorithm to pick up very similar pages as well as identical pages because most people want to see these. Sometimes it picks up false positives which can be ignored.

The duplication sensitivity settings can be adjusted in the Report Settings if you want to remove some of the similar pages.

The Duplicate Body content report shows URLs where DeepCrawl has looked at the text in the body of the page only, whereas results in the Duplicate Pages report are from DeepCrawl analysing the full page including HTML tags.

The duplicate body content report can sometimes pick up pages which are similar but have different templates.

The Analytics data is from the past 30 days. Alternatively, you can manually upload up to 6 months worth of Analytics data when setting up your crawls, by following the CSV format provided.

 

This can sometimes be caused by the default crawl speed of 3 URLs per second being too fast for your site servers. So pages are being recorded by DeepCrawl as a 503 error, but they render OK for the average user.

In your Crawl Settings, you can reduce the max crawl speed to 1 URL per second to reduce the possibility of 503 errors being reported.

This means it will take longer to complete your crawl, but reduces the possibility of your 5xx reports containing the 503 again.

DeepCrawl can correctly handle characters in any language, including the content length calculations.

URLs with non-Latin characters will be displayed in an unencoded format in the interface and downloads.

Back to top

Problems

This is usually a problem with emails being blocked as potential spam. Please check your spam folder. If possible, try whitelisting the DeepCrawl email address.

donotreply@deepcrawl.co.uk

This is usually caused by a project configuration issue.

Check that the domain you are trying to crawl is working properly, and the crawler is not disallowed in the site’s robots.txt file.

You can try changing the IP settings or user agent which can sometimes resolve the problem.

Use Fetch as DeepCrawl tool to see the response that DeepCrawl is getting as this may help reveal the issue and is useful for spot-checking URLs overall.
https://tools.deepcrawl.co.uk/fetch-as-deepcrawl/

It may be the case that DeepCrawl is being blocked by the site server, and in this case, whitelisting a static IP address will allow us to run a successful crawl.

Back to top

Affiliates

Nothing! Except your desire to be an affiliate. You don’t even need a website (although having your own website helps).

You can promote our product by writing reveiws and blog posts; advertising on your site or on search engines; postings in internet forums; posting a link on social media; recording videos; or simply emailing all your friends and people you know with your affiliate link to our product.

All you need to do is send visitors to our site via a special link (called an ‘affiliate link’). Then, if they buy anything from us within 60 days, using your special link, you will get 20% of the sale value upon signup, as well as 10% monthly recurring commission for 12 months. Please remember to add the rel=”nofollow” attribute to any links on your website.

DeepCrawl’s affiliate programme is powered by Post Affiliate Pro, an advanced affiliate tracking software. Post Affiliate Pro uses a combination of cookies and IP address to track referrals for the best possible reliability.

When the visitor follows your affiliate link to our site, our affiliate system registers this referral and places a cookie on their computer.

When the visitor pays for the product, the affiliate system checks for the cookie (if it’s not found, it checks for the IP address of the referral) and credits your account with 20% of the purchase value. This process is entirely automatic. All your referrals will be properly tracked.

Post Affiliate Pro is used by thousands of internet merchants and affiliates worldwide.

We currently support bank transfers. The minimum payout value is £100, so this will be transferred to you once you have reached this threshold. You may increase this minimum payout value via your affiliate account dashboard. We do not support bank checks at the moment.
Payments are issued in GBP, and are paid once a month, always on 15th.

Setting up an account is very easy and completely free.

All you need to do is fill in the signup form below. After a review from our affiliate manager, you will receive an email with your password and other information.

As our affiliate, you will have your own control panel where you can see detailed statistics of traffic and sales you referred, news and training materials, and a choice of banners and text links.

A few simple steps:

  • Fill in the signup form.
  • Receive your password and other info by email.
  • Log in to your own affiliate panel and choose from various banners, text links, reviews and other promotional materials.
  • Place some of these banners/links onto the home page of your website or as many pages as you want.
  • Receive 20% commission from every first sale, then 10% monthly recurring payment for 12 months.

Back to top