DeepCrawl supports the uploading of a wide range of file types to crawl sources. This page runs through the valid file types for each source, and where to find them.

Base URL

All source uploads allow you to set a 'Base URL'. This is used only when we encounter a relative URL in your upload - for instance, if a URL in your upload starts with "/", we will prepend the base URL (http://www.example.com).
If you do not set a base domain, we will use the project's primary domain instead.

Have a file which is not supported?

DeepCrawl can support any UTF-8 CSV which is not mentioned on this page. When you upload it to your project, DeepCrawl will ask you which columns contain the required datapoints.

Note that DeepCrawl can only accept data which is aggregated to a URL, i.e. a file with each row containing a URL and the number of backlinks that each URL is perfect, but a raw backlink export containing details of every single link is not supported.

Analytics

Google Analytics

Go to Behaviour > Site Content > Landing Pages
Scroll to the bottom of the report, and select "Show Rows: 5000"
At the top of the page, choose Export > CSV
This CSV is not currently supported. You must reformat this data to match our default Analytics format, found at https://www.deepcrawl.com/wp-content/uploads/2017/09/ExampleAnalyticsUpload-3.csv

Adwords URLs

Adwords destination URLs can be imported into DeepCrawl's Analytics metrics to help you ensure that you are sending users to relevant pages, and that they're not broken or orphaned. 

In Adwords, load the "Reports" screen
Choose "Predefined Reports" > "Basic" > "Final URL"
Download this report as a csv, and upload this to DeepCrawl's Analytics tab in your Project's settings.

Backlinks

Google Search Console

Go to Google Search Console and find the ‘All Linked Pages’ report. 
Select “Search Traffic” > “Links To Your Site” then choose the “More >” link under “Your most linked content”

Download the CSV by clicking “Download this table”

Majestic

Find your website in Majestic, choose the “Pages” report, and export this data to a CSV using the "Export" button.

Ahrefs

In Ahrefs, choose the Pages > Best by links report, and export this data to a CSV using the 'Export' button. Download the CSV "For Open Office, Libre & other (UTF-8)" and upload this to DeepCrawl.

Open Site Explorer

In Open Site Explorer, choose the Top Pages report and export the data to CSV using the "Request CSV" link.

Default Format

If you do not have access to any of the above datasources, you can reformat your data to our default format: https://www.deepcrawl.com/wp-content/uploads/2017/09/ExampleBacklinksUpload-1.csv

Log File Summaries

DeepCrawl supports a range of exports from your favourite log file analyser. We are unable to process raw log files, these must be summaries of the number of requests on a URL level.

Screaming Frog Log Analyser

Note that the "Screaming Frog Web Crawler" does not process log files. We support exports from the "Screaming Frog Log File Analyser".

In Screaming Frog Log Analyser, open the URLs tab, and export this.

Splunk

Run the following queries to export summary statistics, you will normally need to edit these to match your setup: "host" should be the domain you're exporting data for, "useragent" is the user agent field, and "uri" is the URL field. Please contact support for assistance doing this.

Googlebot :
host="[YOUR DOMAIN]" | stats count(eval((useragent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" OR useragent="Googlebot-Image/1.0" OR useragent="Googlebot/2.1 (+http://www.google.com/bot.html)") )) as "google_desktop_requests",  count(eval((useragent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" AND NOT useragent="*deepcrawl*") )) as "google_smartphone_requests" by uri | rename uri to "splunk_uri" | where google_desktop_requests > 0 OR google_smartphone_requests > 0

Bingbot:
host="[YOUR DOMAIN]" | stats count(eval((useragent = "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"))) as "bing_desktop_requests",  count(eval((useragent = "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" OR useragent = "Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"))) as "bing_smartphone_requests" by uri | rename uri to "splunk_uri" | where bing_desktop_requests > 0 OR bing_smartphone_requests > 0

Logz.io

In logz.io, open Kibana visualise and create a query using the Metric aggregation "count", and buckets: Split Rows > Aggregation: Terms, Field: request, Order By: "metric: Count", Order: Descending, Size: 200
Use the following queries, and export using the "Export Raw" link. Upload this file to DeepCrawl.
By default, logz.io will only export the top 200 pages using this method. You should ask your logz.io account manager to increase this limit.

Googlebot Desktop Crawler:
agent:/.*Mozilla\/5.0 \(compatible\; Googlebot\/2.1\; \+http:\/\/www.google.com\/bot.html\).*/

Googlebot Mobile Crawler:
agent:/.*Mozilla\/5.0 \(Linux; Android 6.0.1; Nexus 5X Build\/MMB29P\) AppleWebKit\/537.36 \(KHTML, like Gecko\) Chrome\/41.0.2272.96 Mobile Safari\/537.36 \(compatible; Googlebot\/2.1; \+http:\/\/www.google.com\/bot.html\).*/

Default Format

If you do not have access to any of the above datasources, you can reformat your data to our default format: https://www.deepcrawl.com/wp-content/uploads/2017/09/ExampleLogSummaryUpload-1.csv