Rendering has arrived in DeepCrawl

We've just released our page rendering service, meaning that we can crawl your websites and execute Javascript in the same way that a search engine might. With a rendered crawl, we'll be able to analyse content and links that were injected or modified by scripts on your page; we've also got a couple of special features to radically improve what you can do with a crawl.

Activating Javascript Rendering

To use rendering, you will first need to purchase the rendering addon for your subscription. Once activated, you only need to tick the "Use Javascript Rendering" box on Step 1 of the crawl setup.

Blocking analytics, advertisement, and other special scripts from being rendered

Because our renderer uses an off-the-shelf version of Google Chrome to render pages, it will execute many analytics, advertisement, and other tracking scripts during a crawl. This will result in your data being inflated as we register hits.

By default, we block execution of some common analytics and advertisement scripts such as Google Analytics, Adsense, and DoubleClick, but it is important that you block any scripts which you do not want to be executed. You can do this in the Advanced Settings of your project by adding URLs or URL patterns to the "Custom rejections" setting.

You should not rely on robots.txt files to block scripts - all important files should be blocked within the renderer settings.

Injecting Custom Javascripts

To allow for additional analysis and page manipulation, we've added a feature for you to inject any Javascript libraries or code that you like. 

With this functionality, you could:
- Test changes to Javascript and content before they get made on your live website
- Inject and use testing suites, for instance, undertake WCAG accessibility tests, confirm mobile friendliness, or check for specific combinations of page elements
- Remove elements which may impact your crawl

To use this, simply add up to 10 Javascript URLs to "External JavaScript resources", and Javascript code to the "Custom Javascript" input.

If you want to complete some analysis, and see/extract the results in DeepCrawl, you should write the output to an element on the page, then extract this with a Custom Extraction.
For instance, create a new span and add an output from the analysis:
var newSpan = document.createElement("span");
newSpan.innerText = "WCAGOUTPUT:"+JSON.stringify(wcag_test)+":WCAGOUTPUT"
document.body.appendChild(newSpan);
Then extract it to a Custom Extraction using a regex:
<span>WCAGOUTPUT:(.*?):WCAGOUTPUT</span>
Using this method, you could complete any advanced custom analysis or extraction.
For instance, to check the template of the page based on several factors, you could check via CSS selectors, and output a value that DeepCrawl can extract:

var body_css = document.body.className
var price_element = document.querySelectorAll("#price_tag")
var page_type = "unknown"
if(body_css == "product_css" && price_element.length==1){
page_type = "product"
}
if(body_css == "product_css" && price_element.length>1){
page_type = "category"
}
if(body_css == "generic" && price_element.length>0){
page_type = "homepage"
}
if(body_css == "generic" && price_element.length==0){
page_type = "help_section"
}
var newSpan = document.createElement("span");
newSpan.innerText = "PAGETYPEOUTPUT:"+page_type+":PAGETYPEOUTPUT"
document.body.appendChild(newSpan);
Then extract it to a Custom Extraction using a regex:
<span>PAGETYPEOUTPUT:(.*?):PAGETYPEOUTPUT</span>

Rendering FAQs

Is rendered crawling slower than regular crawling?

There are many factors which can make rendered crawling slower than regular crawling.

- Rendered crawling takes longer to do because pages need to download all relevant resources, parse them, and completely finish loading everything. We've observed that pages which have a fetch time of less than a second can take several seconds to render, depending on their complexity. 
- Rendered crawling can put a higher load on your server as pages typically make several requests for resources and data. At a high crawl rate, this additional load can slow down websites which do not have adequate infrastructure to handle this load.

What is the rendering timeout?

We will wait up to 10 seconds for a page to finish rendering - after 10 seconds has passed, we will take whatever has loaded in the page and use that for analysis. This typically happens when a page has an analytics heartbeat script which keeps the page 'alive' indefinitely.

What is the rendering engine? Is it the same as Google's?

At the time of writing, we use Google Chrome v67 for rendering. This is similar to Google's engine (which uses Google Chrome v41). We are unable to use the same version as Google, as several features which are critical for our service were not added until v59, and further improvements which make it easier for us to crawl were made in v65. For security and to ensure that we are up to date with web technologies, we will endeavour to upgrade our rendering engine whenever a stable update is available.

If your website uses cutting edge Javascript functionality, the version discrepancy may cause some difference in the rendered pages that DeepCrawl sees vs. what Google sees. We are endeavouring to find a solution to list incompatible features that your pages are using.

Some DeepCrawl metrics break when I use DoubleClick Floodlight (or other scripts) with Google Tag Manager on my site

Rather than injecting external resources into pages (which we could block), Google Tag Manager occasionally has a behaviour of creating iframes inside the page's , which cannot be blocked by our renderer (even if the iframe's src URL is blocked, the <iframe> element is still created). 

Because <iframe>'s are not a valid tag to exist inside the page header, our crawler (and search engine crawlers) can consider the <head> to have ended. This means that any subsequent elements which must be in the , such as canonical, rel=alternate, meta noindex, and others, are considered to be in the page's body, and so are ignored.

A universal way to fix this is to edit your Google Tag Manager configuration to prevent the DoubleClick Floodlight tag being added to the page when a search crawler's user agent visits the page. 

As a quick fix, you can use DeepCrawl's "Custom Javascript" functionality to remove the iframe from the head during the render - this will not fix any issues that other crawlers may have, but it will mask it from your DeepCrawl analysis.
Add the following script to the "Custom Javascript" section of your project's Advanced Settings:
document.head.querySelectorAll('iframe').forEach(function(e){e.remove()})

Static IPs, Geo IPs, and Custom DNS will not work with rendering

Because of the way that Chrome handles proxies and DNS, we are unable to do a rendered crawl with a specified static IP like we are able to do in regular crawling. Instead, all requests from the rendered crawler will come from the address '52.5.118.182' - if you need to whitelist us to allow crawling, you should add this IP address to your whitelist.

Your project's existing static IP may still be used to fetch some non-page resources, such as Javascript, Images, and CSS files.

Custom DNS settings do not currently work with rendering. Please contact your account manager for more information about the restrictions of DNS and rendering.

How does custom script injection work?

You can inject up to 10 external custom scripts - these will be injected into the page as it loads. The page and these additional resources must be fully loaded and rendered within the page timeout of 10 seconds.

The custom script string injection is executed after the page has been rendered, and is allowed to run for up to 5 seconds. All scripts which are injected should be synchronous - the renderer will not wait for a callback from an asynchronous function, and await will not work. 

If you use a custom script injection, we will take a snapshot of the page's DOM after all scripts have been executed, meaning that DeepCrawl reports will reflect any changes that your custom scripts make to the page.

Does DeepCrawl detect Javascript redirects?

Yes, DeepCrawl detects and reports on redirects which were created with Javascript (i.e. window.location="/newpage").

However, we are unable to detect page state changes: a state change is essentially when a page makes it look like the URL has changed, but it has not actually changed. If your website utilises state changes, you can make DeepCrawl detect these by turning them into a proper page change, by adding the following custom script to your project settings:
if(window.history.state.startingURL!=window.location){window.location=document.location}

What Analytics and Ad scripts are blocked by default?

When you activate the "Block analytics" and "Block ads" options, we will prevent the following scripts and domains from being executed by the renderer. You can add other scripts to be blocked in the Advanced Settings. 

Analytics
*//*.google-analytics.com/analytics.js
*//*.google-analytics.com/ga.js
*//*.google-analytics.com/urchin.js
*//*stats.g.doubleclick.net/dc.js
*//*connect.facebook.net/*/sdk.js
*//platform.twitter.com/widgets.js
*//*.coremetrics.com/*.js
*//sb.scorecardresearch.com/beacon.js
quantserve.com
service.maxymiser.net
cdn.mxpnl.com
statse.webtrendslive.com
*//s.webtrends.com/js/*

Ads
*//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js
*//*.2mdn.net/ads/*
*//static.criteo.net/js/ld/*.js
*//widgets.outbrain.com/outbrain.js
*//*.g.doubleclick.net/*
*//c.amazon-adsystem.com/aax2/apstag.js
*//cdn.taboola.com/libtrc/dailymail-uk/loader.js
*//ib.adnxs.com/*
*://*.moatads.com/*/moatad.js
track.adform.net

What's next for JS rendering?

The ability to render pages during a crawl is only the first step. We see Javascript rendering eventually being a flagship feature of DeepCrawl, so over the coming months we'll be testing and adding several new rendered analyses (think Lighthouse, mobile friendliness, AMP validation), and alerts when your pages are coded in a way that may impede even the best rendered crawlers.

If you have requests or ideas about what we should do, please get in touch