At DeepCrawl we track over 250 metrics to help our users understand their website. In this guide, we will explain a bit more about how our systems work and how to store the information around your crawl data.
Click here if you just want to jump straight to exploring our reports and metrics.
What are metrics in DeepCrawl?
A metric is a piece of information about a page, link, or sitemap that we have extracted from a URL or has been calculated in our system (e.g. DeepRank).
Here are some examples of metrics we store around a URL:
- Title tag
- Meta robots tag
- HTTP Header
There are different levels of metrics which we have to calculate within the DeepCrawl system.
For example, Meta Noindex is a low level true or false metric which lets you know whether a page has the noindex meta tag. Indexable is a high-level metric which needs to take into account several metrics to be accurate (such as noindex tags, headers, canonicalisation, etc.). All these different metrics, once calculated, let our system identify if a page is indexable or non-indexable.
For all pages fetched and processed in our system, we collect more than 300 metrics which include everything from a page's title to the number of Search Console impressions.
What are reports in DeepCrawl?
A report in DeepCrawl is a combination of different metrics - while a metric is an individual piece of information about a page, a report takes many metrics and their values into account.
For example, the Page Title metric is the title that we extracted from your page, but the Short Titles report is a list of URLs which have a short title and are indexable.
Examples of reports in DeepCrawl:
- Noindex pages: Pages which have a meta robots or X-robots noindex.
- Canonicalized pages: Pages whose canonical tag is not self-referencing.
- Primary pages: Indexable pages which are unique or the primary of a set of duplicates.
What are DeepCrawl’s datasources?
During our crawls, we collect information about URLs, links between those URLs, and sitemaps. As these three pieces of data are so different from each other, we separate them into separate main databases.
Pages and URLs
This datasource contains each URL and all metrics related to each URL. For example:
- Indexable pages
- Non-200 pages
- 301 Redirects
This datasource contains each link and related metrics, for example:
- Source URL
- Target URL
- Orphaned pages
It also contains links which have issues. For example broken links, links between protocols, and a few other cases.
We do not currently store every single link and its source that we see during a crawl as this is typically terabytes of data. If you are interested in all links between pages, look at Unique Links.
This datasource contains every unique link that we saw during the crawl. For example:
- Anchor text
- Target page data
- Primary sources
If your website has a navigation link to the homepage on every page of the website, then we will save that link once along with a count of the times we saw that link.
This datasource includes Information about the sitemaps we processed during the crawl. For example:
- Broken/disallow sitemaps
- URL count in sitemaps
- Sitemap type
Explore DeepCrawl Reports
Click on the links below to understand the metrics behind specific reports. All reports have been grouped into datasources.
URLs and Pages | Links | Unique Links | Sitemaps
Using Reports and Metrics with the API
You can query URLs in the API using reports - the two concepts for this are:
Reports contain aggregate information - total is the count of URLs which match that report query, added is the number of new URLs in that report since the last crawl, etc. This is available by calling /accounts/:account_id/projects/:project_id/crawls/:crawl_id/reports/:report_code_basic
Report rows are the raw data of each URL (and relevant metrics) within that report. These can be accessed in the API using /accounts/:account_id/projects/:project_id/crawls/:crawl_id/reports/:report_code_basic/report_rows