DeepCrawl can now crawl JavaScript websites by using our page rendering service (PRS) feature. This release allows us to analyse the technical health of JavaScript websites or Progressive Web Apps (PWAs).

Page Rendering Service (PRS)

DeepCrawl can use the page rendering service to execute JavaScript just like modern search engines. The page rendering service allows DeepCrawl to discover links and content which is client-side rendered using modern JavaScript libraries or frameworks.

The page rendering service uses the most up-to-date version of Google Chrome for rendering JavaScript. For security and to ensure that we are up to date with web technologies, we update our rendering engine with a new release of Google Chrome whenever a stable update is available.

For all the latest features that Google Chrome supports we recommend referring to chromestatus.com or use the compare function on caniuse.com.

How is the PRS different from how Google renders JavaScript?

At the time of writing this guide,  Google’s web rendering service (WRS) uses Chrome version 41 to render web pages. DeepCrawl is unable to use Chrome 41 because the features which allow us to use Google Chrome as part of our crawling service were not available until Chrome 59. 

As DeepCrawl uses a later version of Chrome than Google, there will be discrepancies in the types of web platform features and capabilities supported compared to what Google can render.

For a full list of features that Google Chrome 41 supports we recommend referring to chromestatus.com or use the compare function on caniuse.com. 

Google recently announced at the Chrome Summit 2018 that they are working on having WRS run alongside Chrome’s release schedule, so their renderer will always use the most up-to-date version of Chrome.

How is the PRS different from how Bing renders JavaScript?

Bing has officially announced that Bingbot renders JavaScript when it is encountered. However, it is difficult for Bingbot to crawl JavaScript across the web at scale. Just like Google’s web rendering service, the rendering engine does not support the latest JavaScript frameworks.

Bingbot uses a customisable headless engine to render pages and is difficult to map to a specific browser version. As the PRS at DeepCrawl uses the latest version of Google Chrome, there will be discrepancies between what DeepCrawl and what Bing can render.

If you wish to better understand if Bingbot can render pages, then we recommend using the Bing mobile-friendly test tool - as it uses the same customisable rendering engine as Bingbot.

How does PRS work in DeepCrawl?

DeepCrawl is a cloud-based website crawler that follows links on a website or web app and takes snapshots of page-level technical SEO data.

Traditionally, DeepCrawl works as follows:
  1. Start URL(s) and URL data sources are inputted into the web crawler.
  2. The web crawler begins with the start URL(s).
  3. URLs are added to a crawl queue (list of URLs to crawl), and the priority of what to fetch is based on where it was found in the site.
  4. A URL is fetched from the queue, the raw HTML of a web document is parsed, and key SEO metrics are stored.
  5. Any links discovered in the raw HTML of the document are added to the crawl scheduler.
  6. All SEO metrics fetched by the crawler is passed to our transformer which processes the SEO data and calculates metrics (e.g. DeepRank).
  7. Once the transformer has finished analysing the data, it is passed to the reporting API, and the technical reports in the DeepCrawl app are populated.
For traditional web crawlers, like DeepCrawl, a website which relies on JavaScript represents a problem because links, metadata, and content are not loaded in the raw HTML, but are loaded client-side in the Document Object Model (DOM). So important SEO metrics and links can be missed. To make sure that DeepCrawl can understand modern websites or web applications which use JavaScript, the crawler uses the page rendering service. When the DeepCrawl crawler fetches a web document, it requests the page using the PRS (Google Chrome). When the page is fetched, the page is rendered, and it waits for up to 10 seconds for the DOM to load and grab the rendered HTML. The rendered HTML is then parsed, SEO metrics are stored, and any anchor links with an href attribute are added to the crawl queue. The page rendering service works in DeepCrawl as follows:
  1. Start URL(s) and URL data sources are inputted into the web crawler.
  2. The web crawler begins with the start URL(s).
  3. URLs are added to a crawl schedule (list of URLs to crawl), the priority of what to fetch is based on levels.
  4. A URL at the start of the list of URLs to crawl is fetched using the PRS, which waits for up to 10 secs for the page to load, and the crawler fetched both the raw HTML and rendered HTML of a web document.
  5. The rendered HTML is parsed, and the SEO metrics are stored.
  6. Any links discovered in the rendered HTML of the document are added to the crawl scheduler.
  7. The crawl scheduler waits until all web documents on the same level have been found before the crawler can begin crawling the next level (even if lower level pages are in the URL crawl queue).
  8. All SEO metrics fetched by the crawler is passed to our transformer which processes the SEO data and calculates metrics (e.g., DeepRank).
  9. Once the transformer has finished analysing the data, it passes it to the reporting API, and the technical reports in the DeepCrawl app are populated.
This process allows us to crawl and render the DOM which enables us to crawl websites which rely on JavaScript frameworks or libraries.

PRS and DeepCrawl crawl limits

Please be aware that when the PRS loads a page, it can request multiple resources per page (JavaScript, CSS, Images) to render the DOM. If the crawl speed is set too high, this could overload a website’s server as DeepCrawl will be making a high number of requests.
Our team recommends running a test crawl to make sure that the website’s server is not overloaded.

If you are unsure of what speed to set the crawler, please contact our Customer Success team using the help portal in the DeepCrawl app.

PRS and links

It’s important to remember that that JavaScript sites need to follow current link architecture best practices to make sure that the PRS can discover and crawl them.

DeepCrawl will only discover and follow links which are generated by JavaScript if they are in an a HTML element with an href attribute.

Examples of links that DeepCrawl will follow:
<a href="https://crawl.com"></a>
<a href="/relative/bot/file"></a>
<a href="relative/bot/file"></a>
Examples of links PRS will not follow (by default):
<a routerLink="crawl/bots">
<span href="https://deepcrawl.com">
<a onclick="goto('https://deepcrawl.com')">
This is in line with current SEO best practice and what Google recommends in its Search Console help documentation.

PRS and dynamic content

It is essential to understand that rendered HTML elements which require user interaction will not be picked up by the PRS. So any critical navigational elements or content which do not appear in the DOM until a user clicks or gives consent will not be captured by DeepCrawl.

Examples of dynamic elements the PRS will not pick up:
  • Onclick Events
  • onmouseover and onmouseout Events
  • Deferred loading of page elements (lazy loading)
This default behaviour is in line with how Google currently handles events after a page has loaded. For further information about how Google handles JavaScript:

PRS is stateless when crawling pages

When the PRS renders a page, it is stateless by default, meaning that:
  • Local storage data is cleared when each page is rendered.
  • HTTP cookies are not accepted when the page is rendered.
This also means that by default any content which requires users to download cookies will not be rendered by the PRS. This in line with Google’s own web rendering service specifications.

PRS declines permission requests

Any content which requires users to consent is declined by the page rendering service by default, for example:
  • Camera API
  • Geolocation API
  • Notifications API
This in line with how Google’s web rendering service handles permission requests.

PRS static geo IP address

The PRS is unable to run a rendered crawl with a specified geo static IP. All requests from the rendered crawler will come from the address `52.5.118.182', which is based in the United States.
 
If you need to whitelist us to allow crawling, you should add this IP address to your whitelist.

PRS and custom DNS

Custom DNS settings do not currently work with rendering. Please contact your account manager for more information about the restrictions of DNS and rendering.

PRS Blocks Analytics and Ad Scripts by Default

The PRS by default blocks common analytics and advertisement scripts. As the PRS uses an off-the-shelf version of Chrome, it will execute many analytics, advertisement, and other tracking scripts during a crawl. This will result in your data being inflated as we register hits.

Analytics Scripts Blocked

A list of analytics tracking codes DeepCrawl blocks by default:
  • *//*.google-analytics.com/analytics.js
  • *//*.google-analytics.com/ga.js
  • *//*.google-analytics.com/urchin.js
  • *//*stats.g.doubleclick.net/dc.js
  • *//*connect.facebook.net/*/sdk.js
  • *//platform.twitter.com/widgets.js
  • *//*.coremetrics.com/*.js
  • *//sb.scorecardresearch.com/beacon.js
  • quantserve.com
  • service.maxymiser.net
  • cdn.mxpnl.com
  • statse.webtrendslive.com
  • *//s.webtrends.com/js/*
Advertisements Scripts Blocked A list of advertisement tracking codes DeepCrawl blocks by default:
  • *//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js
  • *//*.2mdn.net/ads/*
  • *//static.criteo.net/js/ld/*.js
  • *//widgets.outbrain.com/outbrain.js
  • *//*.g.doubleclick.net/*
  • *//c.amazon-adsystem.com/aax2/apstag.js
  • *//cdn.taboola.com/libtrc/dailymail-uk/loader.js
  • *//ib.adnxs.com/*
  • *://*.moatads.com/*/moatad.js
  • track.adform.net
Any custom scripts which are not critical to rendering content should be blocked. To block scripts, use advanced settings > spider settings> JavaScript rendering > custom rejections.

PRS allows custom script injection

The PRS allows custom scripts to be injected into a page while it is being rendered. This unique feature allows for additional analysis and web page manipulation.
The page rendering service allows custom scripts to be added by:
  • Adding up to 10 JavaScript URLs to “External JavaScript resources“
  • Adding a script to the “Custom JavaScript” input.
To pull data injected onto the page using custom injection, output needs to be added to the page and then extracted using the Custom Extraction feature.

This page rendering functionality allows users to:
  • Manipulate elements to the Document Object Model (DOM) of a page
  • Analyse and extract Chrome page load timings for each page
  • Create virtual crawls and change behaviour of DeepCrawl
Learn more about using DeepCrawl custom script injection to collect Chrome page speed metrics.

PRS follows JavaScript redirects

The PRS detects and follows JavaScript redirects. They are treated like normal redirects when being processed and are shown in the Config > Redirects > JavaScript redirects report in DeepCrawl.

PRS not able to detect state changes

The PRS is unable to detect state changes by default.

If your website uses state changes, the PRS can detect them by turning them into a proper location change by adding the following script in the “Custom Script” field.
if(window.history.state.startingURL!=window.location){window.location=document.location}

PRS disables certain interfaces and capabilities

The PRS disables the following interfaces and capabilities in Google Chrome:
  • IndexedDB and WebSQL interfaces
  • Service Workers
  • WebGL interface is disabled
This is in line with Google’s web rendering service specifications when handling certain interfaces and capabilities.

AJAX crawling scheme and DeepCrawl

At the time of writing, DeepCrawl still supports the AJAX crawling scheme. For more information, please read our 60-second DeepCrawl AJAX crawling guide on how to set this up.

Please be aware, even though DeepCrawl supports the AJAX crawling scheme, Google officially announced they depreciated support for this crawling scheme in October 2015.

Both Google and Bing have both recently recommended dynamic rendering to help JavaScript-generated content to be crawled and indexed. If a site still relies on the AJAX crawling scheme, then this new solution by Google can help your JavaScript-powered website to be crawled and indexed.

Page Rendering Service Feedback

Our team sees the page rendering service as being a flagship feature in DeepCrawl which will give us the ability to add new features like Chrome page load timings. If you have any requests or ideas about what we should be doing, then please get in touch.