BigQuery SQL Cheat Sheet

Alec Bertram
Alec Bertram

On 8th January 2021 • 8 min read

Our BigQuery connector gives you easy access to run SQL queries across all of the crawl data in your account. We’ve structured the data to be flexible, and have listed some easy SQL queries on this page to allow you to quickly find the data you need.

Data format

Each DeepCrawl project is a BigQuery dataset which uses its ID as its name (such as project_123456), each crawl is a table inside this dataset – each table is named crawl_urls_[DATE]-[crawl ID] such as ‘crawl_urls_20201225_654321‘.

The crawl data itself contains all of the data that we know about each URL in the crawl. 

Permissions

To ensure that only the right people have access to the right data, we share access with a specific group of users on a per-project basis. This way, you can share a given project with a single client, and no other clients will be able to access that data.

Finding your Project ID

You can find your project ID in the URL of a crawl report – the format of our URLs is https://app.deepcrawl.com/accounts/:account_id/projects/:project_id/crawls/:crawl_id, so in the URL https://app.deepcrawl.com/accounts/123/projects/456/crawls/789, the project ID is 456

Data Projects & Billing Projects 

Unless you request otherwise, your data will be saved into the BigQuery project “deepcrawl-bi” and shared with you. While you will have the permission to access this data, this project does not come with any querying capabilities. The querying capability will come from a separate Billing Project that will be provided to you (this is specific to your organization, so you should not share it). Therefore, when using data in the deepcrawl-bi project with any tool, you must specify the separate Billing project. 

If you require DeepCrawl data to be saved to your own BigQuery project or within your own GVPC, please discuss this with your CSM. 
 

DeepCrawl SQL Cheatsheet

We’ve structured our data in a way that will allow you to get all of the insights that you need. Rather than manually selecting a given crawl, you can use the below queries to logically choose the table you want to work with (such as “latest crawl”, “all crawls from the last 30 days”). The connector comes with some default data views which internally use these queries, so in some cases, it is easier to use those.

Default views

The following views are available within all datasets in BigQuery:

All of the below examples assume that your project ID is 1234

Get all data from the latest crawl

Use the view _crawl_urls_latest or the SQL query:

SELECT * 
FROM 'deepcrawl-bi.project_1234.crawl_urls_*' 
WHERE 
_TABLE_SUFFIX = (SELECT MAX(_TABLE_SUFFIX) from 'deepcrawl-bi.project_1234.crawl_urls_*') 

Get all 404 pages from the latest crawl in a given project

SELECT url, page_title, http_status_code, search_console_total_clicks 
FROM 'deepcrawl-bi.project_1234.crawl_urls_*' 
WHERE 
http_status_code = 404 AND
_TABLE_SUFFIX = (SELECT MAX(_TABLE_SUFFIX) from 'deepcrawl-bi.project_1234.crawl_urls_*') 

Get all data from the previous crawl

Use the view _crawl_urls_before_latest or the SQL query:

SELECT * 
FROM 'deepcrawl-bi.project_1234.crawl_urls_*'
WHERE 
 _TABLE_SUFFIX = (SELECT DISTINCT _TABLE_SUFFIX FROM 'deepcrawl-bi.project_1234.crawl_urls_*' 
ORDER BY _TABLE_SUFFIX DESC LIMIT 1 OFFSET 1)

Get all data from all crawls in the past 30 days

Use the view _crawl_urls_30d or the SQL query:

SELECT *
FROM 'deepcrawl-bi.project_1234.crawl_urls_*'
WHERE
_TABLE_SUFFIX > CAST(DATETIME_ADD(CURRENT_DATETIME(), INTERVAL -30 DAY) AS STRING)

Get the number of 404 pages for each crawl in the past 30 days

SELECT
CAST(crawl_id AS STRING) as crawl_id,
CAST(MAX(crawl_datetime) AS TIMESTAMP) as datetime,
COUNT(DISTINCT url) as count_urls,
FROM 'deepcrawl-bi.project_1234.crawl_urls_*'
WHERE
http_status_code = 404 AND
CAST(crawl_datetime AS TIMESTAMP) > CAST(DATETIME_ADD(CURRENT_DATETIME(), 
INTERVAL -30 DAY) AS TIMESTAMP)
GROUP BY _TABLE_SUFFIX, crawl_id
ORDER BY _TABLE_SUFFIX DESC

Find URLs where the page title has changed since the last crawl

SELECT latest.url, latest.page_title 'current_page_title', previous.page_title 'previous_page_title'
FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' latest, 'deepcrawl-bi.project_1234._crawl_urls_before_latest' 
previous 
WHERE latest.url_digest=previous.url_digest AND latest.page_title != previous.page_title

Find the number of URLs crawled per day for the past 30 days

SELECT COUNT(rownum), SUBSTR(crawl_datetime,0,10) 
FROM 'deepcrawl-bi.project_1234._crawl_urls_30d' 

GROUP BY SUBSTR(crawl_datetime,0,10)

Create a Venn Diagram of where URLs were found (in Data Studio)

SELECT REGEXP_REPLACE(CONCAT(IF(found_in_web_crawl=TRUE,"DeepCrawl, ",""),IF(found_in_sitemap=TRUE,"Sitemap, ",""),IF(found_in_google_analytics=TRUE,"Analytics, ",""),
IF(found_in_log_summary=TRUE,"Google, ",""),IF(found_in_google_search_console=TRUE,"Search Console, ",""),IF(found_in_list=TRUE,"List","")), ', $', "") 'dimension',
 COUNT(rownum) 'metric'  FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE 
SAFE_ADD(SAFE_ADD(SAFE_ADD(SAFE_ADD(if(found_in_web_crawl=TRUE,1,0),if(found_in_sitemap=TRUE,1,0)),if(found_in_log_summary=TRUE,1,0)),if(found_in_google_search_console=TRUE,1,0)),if(found_in_list=TRUE,1,0)) > 1  
GROUP BY dimension
UNION ALL
SELECT "DeepCrawl" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_web_crawl=true GROUP BY dimension
UNION ALL
SELECT "Sitemap" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_sitemap=true GROUP BY dimension
UNION ALL
SELECT "Google" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_log_summary=true GROUP BY dimension
UNION ALL
SELECT "Search Console" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_google_search_console=true GROUP BY dimension
UNION ALL
SELECT "List" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_list=true GROUP BY dimension
UNION ALL
SELECT "Analytics" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_google_analytics=true GROUP BY dimension

This will structure data in the format required by the Venn Diagram community visualization.

Author

Alec Bertram
Alec Bertram

Alec is the Head of Product at DeepCrawl. He has 10 years' of experience in SEO, and works to make sure we're building the most valuable things we can for our users.

Choose a better way to grow

With tools that will help you realize your website’s true potential, and support to help you get there, growing your enterprise business online has never been so simple.

Book a Demo