Our BigQuery connector gives you easy access to run SQL queries across all of the crawl data in your account. We’ve structured the data to be flexible, and have listed some easy SQL queries on this page to allow you to quickly find the data you need. Data format Each DeepCrawl project is a BigQuery dataset which uses its ID as its name (such as project_123456), each crawl is a table inside this dataset – each table is named crawl_urls_[DATE]-[crawl ID] such as ‘crawl_urls_20201225_654321‘. The crawl data itself contains all of the data that we know about each URL in the crawl. Permissions To ensure that only the right people have access to the right data, we share access with a specific group of users on a per-project basis. This way, you can share a given project with a single client, and no other clients will be able to access that data. Finding your Project ID You can find your project ID in the URL of a crawl report – the format of our URLs is https://app.deepcrawl.com/accounts/:account_id/projects/:project_id/crawls/:crawl_id, so in the URL https://app.deepcrawl.com/accounts/123/projects/456/crawls/789, the project ID is 456 Data Projects & Billing Projects Unless you request otherwise, your data will be saved into the BigQuery project “deepcrawl-bi” and shared with you. While you will have the permission to access this data, this project does not come with any querying capabilities. The querying capability will come from a separate Billing Project that will be provided to you (this is specific to your organization, so you should not share it). Therefore, when using data in the deepcrawl-bi project with any tool, you must specify the separate Billing project. If you require DeepCrawl data to be saved to your own BigQuery project or within your own GVPC, please discuss this with your CSM. DeepCrawl SQL Cheatsheet We’ve structured our data in a way that will allow you to get all of the insights that you need. Rather than manually selecting a given crawl, you can use the below queries to logically choose the table you want to work with (such as “latest crawl”, “all crawls from the last 30 days”). The connector comes with some default data views which internally use these queries, so in some cases, it is easier to use those. Default views The following views are available within all datasets in BigQuery: Latest Crawl: _crawl_urls_latest Previous Crawl: _crawl_urls_before_latest All data in last 30 days: _crawl_urls_30d All of the below examples assume that your project ID is 1234 Get all data from the latest crawl Use the view _crawl_urls_latest or the SQL query: SELECT * FROM 'deepcrawl-bi.project_1234.crawl_urls_*' WHERE _TABLE_SUFFIX = (SELECT MAX(_TABLE_SUFFIX) from 'deepcrawl-bi.project_1234.crawl_urls_*') Get all 404 pages from the latest crawl in a given project SELECT url, page_title, http_status_code, search_console_total_clicks FROM 'deepcrawl-bi.project_1234.crawl_urls_*' WHERE http_status_code = 404 AND _TABLE_SUFFIX = (SELECT MAX(_TABLE_SUFFIX) from 'deepcrawl-bi.project_1234.crawl_urls_*') Get all data from the previous crawl Use the view _crawl_urls_before_latest or the SQL query: SELECT * FROM 'deepcrawl-bi.project_1234.crawl_urls_*' WHERE _TABLE_SUFFIX = (SELECT DISTINCT _TABLE_SUFFIX FROM 'deepcrawl-bi.project_1234.crawl_urls_*' ORDER BY _TABLE_SUFFIX DESC LIMIT 1 OFFSET 1) Get all data from all crawls in the past 30 days Use the view _crawl_urls_30d or the SQL query: SELECT * FROM 'deepcrawl-bi.project_1234.crawl_urls_*' WHERE _TABLE_SUFFIX > CAST(DATETIME_ADD(CURRENT_DATETIME(), INTERVAL -30 DAY) AS STRING) Get the number of 404 pages for each crawl in the past 30 days SELECT CAST(crawl_id AS STRING) as crawl_id, CAST(MAX(crawl_datetime) AS TIMESTAMP) as datetime, COUNT(DISTINCT url) as count_urls, FROM 'deepcrawl-bi.project_1234.crawl_urls_*' WHERE http_status_code = 404 AND CAST(crawl_datetime AS TIMESTAMP) > CAST(DATETIME_ADD(CURRENT_DATETIME(), INTERVAL -30 DAY) AS TIMESTAMP) GROUP BY _TABLE_SUFFIX, crawl_id ORDER BY _TABLE_SUFFIX DESC Find URLs where the page title has changed since the last crawl SELECT latest.url, latest.page_title 'current_page_title', previous.page_title 'previous_page_title' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' latest, 'deepcrawl-bi.project_1234._crawl_urls_before_latest' previous WHERE latest.url_digest=previous.url_digest AND latest.page_title != previous.page_title Find the number of URLs crawled per day for the past 30 days SELECT COUNT(rownum), SUBSTR(crawl_datetime,0,10) FROM 'deepcrawl-bi.project_1234._crawl_urls_30d' GROUP BY SUBSTR(crawl_datetime,0,10) Create a Venn Diagram of where URLs were found (in Data Studio) SELECT REGEXP_REPLACE(CONCAT(IF(found_in_web_crawl=TRUE,"DeepCrawl, ",""),IF(found_in_sitemap=TRUE,"Sitemap, ",""),IF(found_in_google_analytics=TRUE,"Analytics, ",""), IF(found_in_log_summary=TRUE,"Google, ",""),IF(found_in_google_search_console=TRUE,"Search Console, ",""),IF(found_in_list=TRUE,"List","")), ', $', "") 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE SAFE_ADD(SAFE_ADD(SAFE_ADD(SAFE_ADD(if(found_in_web_crawl=TRUE,1,0),if(found_in_sitemap=TRUE,1,0)),if(found_in_log_summary=TRUE,1,0)),if(found_in_google_search_console=TRUE,1,0)),if(found_in_list=TRUE,1,0)) > 1 GROUP BY dimension UNION ALL SELECT "DeepCrawl" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_web_crawl=true GROUP BY dimension UNION ALL SELECT "Sitemap" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_sitemap=true GROUP BY dimension UNION ALL SELECT "Google" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_log_summary=true GROUP BY dimension UNION ALL SELECT "Search Console" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_google_search_console=true GROUP BY dimension UNION ALL SELECT "List" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_list=true GROUP BY dimension UNION ALL SELECT "Analytics" 'dimension', COUNT(rownum) 'metric' FROM 'deepcrawl-bi.project_1234._crawl_urls_latest' WHERE found_in_google_analytics=true GROUP BY dimension This will structure data in the format required by the Venn Diagram community visualization.