How To Manually Upload Files in DeepCrawl

Adam Gent
Adam Gent

On 23rd September 2019 • 14 min read

DeepCrawl supports manually uploading of a wide range of file types to our crawl sources. This page runs through the valid file types for each source, and where to find them.

 

Supported crawl sources

The following crawl sources, which can be added in step 2 of a project’s advanced settings, support manually uploading data in DeepCrawl:

A full breakdown of how to manually upload files for each source can be found below.

 

Sitemaps

A sitemap file can be uploaded to DeepCrawl by dragging and dropping a file (up to 100MB file size).

DeepCrawl sitemap upload

The sitemap source supports the following file types:

 

Analytics

Our Google Analytics integration allows users to connect their Google Analytics profile to DeepCrawl, which automatically pulls in analytics data. However, there will be certain circumstances when you want to manually upload analytics data from other tools (or from Google Analytics).

A file can be uploaded to DeepCrawl by dragging or dropping it into the Analytics Source (up to 100MB file size).
DeepCrawl Google Analytics upload

When manually uploading a file please be aware that it needs to follow a specific format for our system to automatically recognise supported metrics.

Google Analytics excel format

Download an example file with the supported format here.

If an uploaded file is not in the supported format it will not automatically be processed and will prompt the upload settings to appear.

DeepCrawl Google Analytics upload

You will need to manually match up the following metrics with the data you have uploaded:

By default, any relative URLs in the uploaded file will use the primary domain in the project settings. If you wish to override the primary domain when uploading data, then use the “Base Domain” in the upload settings.

DeepCrawl Google Analytics columns update

Note: Be aware that if the Base Domain is a subdomain or separate domain and is not included in the secondary domain settings then DeepCrawl will not crawl the landing pages.

Once you have manually matched the data with the metrics in the upload settings, hit “save changes”. The file will be processed by DeepCrawl and, if supported, will be marked as processed.

DeepCrawl Google Analytics processed

If the 100MB limit per file is not enough, repeat this process for each file uploaded. Once everything is processed you can run a crawl with the analytics crawl source enabled.

Please see below for the steps that need to be taken to upload Google Analytics and AdWords data into DeepCrawl.

Google Analytics

In Google Analytics go to Behaviour > Landing Pages. If you want to upload pages with organic search traffic, then make sure you are using the Organic Traffic segment.

DeepCrawl Google Analytics upload

Change the number of rows to 5000 to obtain the maximum number of rows on a single page.

Google Analytics row numbers

Note: If you have a lot of URLs then you will need to download multiple pages from Google Analytics.

Export the data in a CSV file format.

Google Analytics CSV Export

Upload the exported CSV file to DeepCrawl in the Analytics crawl source. When the file is uploaded the Upload Settings will be loaded. As already discussed, you will need to assign columns with supported metrics before submitting.

DeepCrawl Google Analytics Upload

Once the metrics are assigned, click “Save Changes”. If there are no errors, then the Google AdWords file format will be marked as “Processed” in the status.

DeepCrawl Google Analytics Upload Processed

AdWords

AdWords destination URLs can be imported into DeepCrawl's Analytics metrics to help you ensure that you are sending users to relevant pages, and that they're not broken or orphaned.

In Adwords, load the "Reports" screen. Choose "Predefined Reports" > "Basic" > "Final URL".

Google Adwords Analytics

Export the Final URL report as a csv. Open the file, remove the first 2 rows and make sure all metrics are on the top row (as this makes it easier to upload).

Google Adwords CSV Data

Upload this to DeepCrawl's Analytics tab in your Project's settings, once the file is uploaded the Upload Settings will appear. As already discussed, you will need to assign columns with supported metrics before submitting.

Uploading Adwords CSV data

Note: Use Clicks as the metric for the Sessions column.

Once the metrics are assigned, click “Save Changes”. If there are no errors, then the Google Analytics file format will be marked as “Processed” in the status.

Uploading Adwords processed data

 

Backlinks

Backlink data can be uploaded to DeepCrawl by dragging and dropping a CSV file (up to 100MB).

DeepCrawl upload backlink data

When manually uploading a backlink .csv file, be aware that it needs to follow a specific format for our system to automatically recognise metrics to pull into reports.

Backlink data CSV example

Download an example file with the supported format here.

If an uploaded file is not in the supported format it will not automatically be processed and will prompt the upload settings to appear.

Editing backlink data in DeepCrawl

You will need to manually match up the following metrics with the data you have uploaded:

By default, any relative URLs in the uploaded file will use the primary domain in the project settings. If you wish to override the primary domain when uploading data, then use the “Base Domain” in the upload settings.

Backlink editor columns

Note: Be aware that if the Base Domain is a subdomain or separate domain and is not included in the secondary domain settings then DeepCrawl will not crawl the landing pages.

Once you have manually matched the data with the metrics in the upload settings, hit “save changes”. The file will be processed by DeepCrawl and, if supported, will be marked as processed.

Backlink data processed in DeepCrawl

If the 100MB limit per file is not enough, repeat this process for each file uploaded. Once everything is processed you can run a crawl with the backlinks crawl source enabled.

Google Search Console

In Google Search Console go to the “Top linked pages – externally” report and click download to get a list of your top target pages with backlinks and linking sites.

GSC download link data

Majestic

Find your website in Majestic, choose the “Pages” report, and export this data to a CSV using the "Export" button.

Download majestic data

Ahrefs

In Ahrefs, choose the Pages > Best by links report, and export this data to a CSV using the 'Export' button. Download the CSV "For Open Office, Libre & other (UTF-8)" and upload this to DeepCrawl.

Download ahrefs data

Open Site Explorer

In Open Site Explorer, choose the Top Pages report and export the data to CSV using the "Request CSV" link.

Download ahrefs open site explorer data

 

Log Files

DeepCrawl supports a range of exports from your favourite log file analyser. As we are unable to process raw log files; these must be summaries of the number of requests on a URL level.

Log file data can be uploaded to DeepCrawl by dragging and dropping a CSV file (up to 100MB).

Log File data upload

When manually uploading a log file, be aware that it needs to follow a specific format for our system to automatically recognise metrics to pull into reports.

Log File data csv upload

Download an example file with the supported format here.

If an uploaded file is not in the supported format it will not automatically be processed and will prompt the upload settings to appear.

You will need to manually match up the following metrics with the data you have uploaded:

Screaming Frog Log Analyser

Note: The "Screaming Frog Web Crawler" does not process log files. We support exports from the "Screaming Frog Log File Analyser".

In the Screaming Frog Log Analyser, open the URLs tab, and export the data.

Screaming Frog log analyser

Splunk

Run the following queries to export summary statistics, you will normally need to edit these to match your setup, "host" should be the domain you're exporting data for, "useragent" is the user agent field, and "uri" is the URL field.

Googlebot:

host="[YOUR DOMAIN]" | stats count(eval((useragent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" OR useragent="Googlebot-Image/1.0" OR useragent="Googlebot/2.1 (+http://www.google.com/bot.html)") )) as "google_desktop_requests", count(eval((useragent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" AND NOT useragent="*deepcrawl*") )) as "google_smartphone_requests" by uri | rename uri to "splunk_uri" | where google_desktop_requests > 0 OR google_smartphone_requests > 0

Bingbot:

host="[YOUR DOMAIN]" | stats count(eval((useragent = "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"))) as "bing_desktop_requests", count(eval((useragent = "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" OR useragent = "Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"))) as "bing_smartphone_requests" by uri | rename uri to "splunk_uri" | where bing_desktop_requests > 0 OR bing_smartphone_requests > 0

Please contact our customer support team or your assigned Customer Success Manager for assistance in manually uploading data from Splunk.

Logz.io

In logz.io, open Kibana visualise and create a query using the Metric aggregation "count", and buckets: Split Rows > Aggregation: Terms, Field: request, Order By: "metric: Count", Order: Descending, Size: 200.

Logz io data download

Use the following queries, and export using the "Export Raw" link, then upload this file to DeepCrawl. By default, logz.io will only export the top 200 pages using this method, you should ask your logz.io account manager to increase this limit.

Googlebot Desktop Crawler:

agent:/.*Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html).*/

Googlebot Mobile Crawler:

agent:/.*Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html).*/

 

URL Lists

DeepCrawl supports uploading of URL lists to be crawled as a source in the project settings.

Upload URL list data

The URL list source supports .txt or .csv file types up to 100MB. Any text files (.txt) uploaded should have URLs on a new line.

URL list text file

Any comma-separated values (.csv) files uploaded should have URLs listed on each new row in the first column.

URL list csv file

Just like all crawl resources, any relative URLs in the supported files will inherit the project’s Primary Domain, alternatively, you can set a different base domain for each upload in the Upload Settings.

Once uploaded it will be marked as processed in the URL Lists crawl source.

URL list processed

 

Any questions?

If you have any further questions about manual uploading files for different crawl resources in DeepCrawl then don’t hesitate to get in touch.

Author

Adam Gent
Adam Gent

Search Engine Optimisation (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

Subscribe today