The State of the Web: Search-Friendly Pagination and Infinite Scroll

Pagination is one of the least understood topics in technical SEO, and further research is needed to better understand how to manage and optimize pagination to make it search friendly.

To get a better understanding of the common pitfalls in making pagination search friendly, we did a small crawl experiment and audited different web designs on 150 sites from the Alexa top 500 sites on the web.

For a deeper dive into how to manage pagination post rel=next and prev, read our pagination SEO best practice guide and the watch video of Adam’s Brighton SEO talk on the topic.

The top 500 sites on the web

Our team analysed 150 websites and used 50 sites from each of the following categories within the Alex top 500:

Shopping
News publishers
Forums

These categories were chosen because in the original Google pagination documents, they highlight that these are the most common site types to have these designs. For each website, we chose a category or forum discussion at random. The only criteria was that the category on each site must have some form of pagination.

Crawling the Alexa top 500 websites

Once we identified the 150 websites and the categories we wanted to investigate, we then used Lumar to crawl each of the websites. For each one, we focused our crawling on specific categories with pagination, and only included pagination URLs. This allowed us to quickly identify the state of the pagination design.

When crawling each website, the following settings were used in the advanced settings in Lumar:

User-agent: Googlebot Smartphone
IP: United States
JavaScript rendering: Disabled

When crawling websites, we quickly realised that sites in the Alexa top 500 used separate mobile URLs. We had to alter the advance settings to account for this and used the Mobile site setting in Lumar. This helped us identify mobile alternate URLs and to see if mobile sites were properly configured (we’ll mention what we found in another post around mobile and pagination).

Once that the websites were crawled, we analysed the data in Lumar and compared the state of each website’s pagination to the search friendly criteria.

Search friendly pagination benchmark

Our team created a benchmark for search friendly pagination using technical criteria based on Google’s official Infinite scroll search friendly recommendations and Introduction to Indexing documents. We had used the indicate pagination documentation, but this was removed during our testing.

The criteria for pagination to be categorized as search friendly was as follows:

Paginated components have unique URLs (e.g. /category-page?page=2).
Paginated components are discoverable through crawlable links on paginated pages.
Paginated pages within the series are indexable.
Indexable paginated pages do not contain duplicate content (based on Lumar).

When creating this criteria, we feared it was too basic and planned to make it more advanced, however, testing revealed many websites did not pass a lot of the fundamentals. The positives of testing!

Also, the original the search friendly criteria included rel=“next” and rel=“prev”, however, Google has announced that it no longer uses it as an indexing signal. This was later confirmed by Google Webmaster Trends Analyst John Mueller on Twitter.

We don’t use link-rel-next/prev at all.

— John (@JohnMu) March 21, 2019

For further information on what this means for SEO, read our technical SEO guide to pagination.

In terms of our benchmark criteria however, it did not really impact our data. We simply removed the rel=next and prev point from our pagination search-friendly criteria; everything else was kept the same.

The results of testing pagination

When analysing the data, we were surprised by some of the results and the state of certain pagination design types. We have summarised the results below, and have also broken down issues so that technical SEOs or site owners can get actionable insights from the results.

Overall results

Our overall results found that 65% of pagination were not search-friendly, based on our criteria.

Pagination search friendly results

The number of websites that did not pass the search friendly criteria was a surprise, as the technical criteria was based on rudimentary SEO practices.

As well as checking if each website’s pagination was search friendly, each type of web design was also recorded. From our analysis, we found that there were three main types of web design:

Pagination
Infinite Scroll
Load more + lazy load

Each of these we felt fall under pagination because they all divide content across multiple pages. As we recorded each type of design, we noted that certain ones were more popular across all three categories.

The types of web designs found in crawl data

As you can see from the above graph, the results of this analysis found that pagination was the most popular type of technique to divide lists of articles or products across multiple pages. Load more + lazy load and infinite scroll (single page web designs) were also used to display content across multiple pages.

When segmenting search-friendliness by web design types, we found that pagination was the most search friendly of the three types.

However, as you can see from the graph below, both the single page web designs were the least search friendly based on the criteria.

Based on these results, it is important to drill into the crawl data for different types of web design to understand why certain types are not search friendly. This not only helps our team better understand common SEO errors* with pagination and infinite scroll, but it also helps anyone with these types of designs to understand how to avoid common SEO pitfalls.

*Note: These will not be all the SEO issues, but these are the most common issues we found in our data.

Pagination

Pagination is the process of dividing lists of articles or products across multiple page components. It is a common and widely used technique to divide lists of articles or products into a digestible format.

Example of pagination on Government website

We will now go through each of the search friendly criteria and highlight common SEO issues that caused many websites to fall below our standards.

Paginated components with unique URLs

Google requires URLs to be assigned to pages for content to be crawled and indexed. Without a URL, content or links associated with each paginated component cannot be discovered, crawled or indexed.

When analysing the websites with pagination, we found that only 5% did not match unique URLs to paginated components.

Results of pagination with unique URLs

For the handful of websites which did not pass this criterion, a common issue was that the paginated page content was mapped to fragment identifiers (#).

Website using fragment identifier in URL

Google has stated that any content that comes after a fragment identifier (#) cannot be crawled or indexed. When mapping paginated components to URLs, make sure that they use absolute URLs and not fragment identifiers to load paginated content.

Absolute URL – https://www.example.com/product-category/page=2
Fragment identifier (#) – https://www.example.com/product-category#page=2

Actions:

Make sure that paginated components have unique URLs (static or dynamic URLs).
Use a browser to inspect if the paginated URL has a fragment identifier (#).

Dynamic vs Static URLs

When analysing the 150 websites, our team noticed that many sites use different types of URLs for pagination:

Static URL – https://www.example.com/product-category/page2/
Dynamic (parameter) URL – https://www.example.com/product-category?page=2

Research found that there is no advantage of using one over the other for ranking or crawling purposes, However, Googlebot does seem to guess and crawl URL patterns based on dynamic URLs based on desktop research (something to test in the future). So, if technical SEOs wanted Googlebot to discover pagination patterns through guessing, then it would be better to use dynamic URLs (parameters).

However, research found that using dynamic URLs can also cause crawling traps. When guessing the paginated URL patterns on a sample set of websites, we found that there were live duplicate dynamic pages which were not part of the current paginated series.

pagination with links

Here is an example of the unique URLs allocated to paginated components in the pagination above:

/clothing/dresses/ – first page in the series

/clothing/dresses/?page=2 – page 2

/clothing/dresses/?page=3 – page 3

/clothing/dresses/?page=4 – page 4

If you manually add parameters based on the URL pattern, then the CMS continues to load live empty pages:

/clothing/dresses/ ?page=5 – page 5

/clothing/dresses/ ?page=6 – page 6

/clothing/dresses/?page=7 – page 7

If Googlebot decided to crawl this URL pattern, then it may crawl and infinite number of paginated pages, which would waste crawl budget.

If you are using dynamic URLs, make sure that your website is not configured to produce a 200 HTTP status code for any dynamic paginated pages which are not part of the current series. If these pages are kept live, then Google might crawl and index duplicate or empty paginated pages.

Actions:

If you want Google to understand pagination URL patterns on the website, then use dynamic URLs, so it can pick up parameter URL patterns.
If you do use dynamic URLs, make sure that your website is configured (set to 4xx) so that any unwanted paginated URLs are not wasting crawl budget.

Crawlable links

For Google to efficiently crawl paginated pages it needs to find anchor links with href attributes. It is critical in getting these pages crawled and indexed, especially with the recent announcement that the search engine no longer supports rel=“next” and rel=“prev” as an indexing signal.

With crawlable internal links being so important, it came as a surprise when we analysed the Alexa top sites on the web and found that 56% of sites with pagination did not properly use anchor links.

Results from websites with crawlable links

This result was concerning, especially as Google and other search engines rely heavily on internal linking within their crawling and indexing quality checks. To understand why a lot of websites did not meet this criteria, and other common issues, we dug into the pagination crawl data.

Missing anchor links

One of the most common issues we found was that links within the design do not use <a href=””> links in the source code or document object model (DOM) to link to paginated pages.

When checking the pagination links on the website, we found that they did not include anchor links. Instead, the website had been built so that content on paginated components were loaded via JavaScript.

Pagination with missing anchor links

When crawling the first page of pagination, which used JavaScript in Lumar, none of the pages which were linked to on the website were found.

DeepCrawl graph when website is missing crawlable links

This is because even with JavaScript rendering enabled, both Lumar and Google require anchor links with href attributes to discover and crawl pages.

Actions:

When linking to paginated pages, make sure they are using anchor links with a h attribute within the link element in the source code.
Test if paginated pages include anchor links and href attribute using third-party crawlers like Lumar (which is designed to follow anchor links with an href attribute).
A quick way of identifying if Google can discover links without JavaScript, is to right click on any page and view the page source (Ctrl + U) and try to find (Ctrl + F) the paginated URLs in the raw HTML.

href attributes using scripts not URLs

As well as finding missing anchor links, our team also discovered many websites had used event scripts (javascript:) instead of absolute or relative URLs in the href attribute.

Pagination with anchor links with event scripts

Pagination with anchor links using event scripts

This usually happens when developers want to have a link on the page, or make it look like an anchor link, but do not want to provide a URL. Instead, they use an event script which is triggered when the link is clicked, and paginated content is loaded to users.

The reason this is a crawlable link issue is that search engines require absolute or relative URLs in href attribute to crawl pages. Without the <a href> link search engines like Google will not crawl or discover internal links to paginated pages.

Again, just like the missing anchor links issue, it seems that a lot of websites with this problem are using JavaScript to load content to users.

Actions:

Review paginated links using Inspect Element tab in Google Chrome to identify how developers have linked to paginated pages.
Use anchor links that use relative or absolute URLs in the href attribute to make sure that search engines can discover links to paginated pages.

Paginated URLs blocked using /robots.txt

Finally, another common crawlability issue we discovered was that paginated URLs were blocked in the robots.txt file.

Pagination URLs blocked by robots.txt

If a URL is blocked in the /robots.txt file, then search engines cannot crawl the page and discover content or outbound links from that page. Many websites might want to block unimportant paginated pages, but if paginated pages are the only access points for traffic or revenue driving pages, then they are an important page of the website architecture.

Businesses and website owners should make sure that important paginated URLs are not accidently blocked while also trying to block other dynamic or static URLs on the website.

Actions:

Use the Google Search Console robots.txt Tester to test if paginated URLs can be crawled by Googlebot.
Always test if paginated URLs have been accidently blocked when disallowing a new parameter URL using the /robots.txt file.

Indexable paginated pages

Paginated pages are important access points for search engines to discover and crawl deeper level pages. The reason that paginated pages should be indexed is due to how Google’s indexing system works, and what it (potentially) does when pages are excluded from the index.

Diagram of Google's crawling and indexing system

The diagram above is taken from a combination of slides from the Google I/O 2018 conference, as well as some good old-fashioned conversations with John at SMX Munich.

All URLs on the web which are crawled by Google go through the same selection process before a page is indexed (see diagram above). This selection process is called canonicalization, which is part of the indexing process, and it happens even if a site owner does not specify a canonical page. Any page that passes this selection process is called a Google-selected canonical URL. These canonical pages are used for:

The main source to evaluate page content.
The main source to evaluate page quality.
The main page to display in search results.

For pages that are excluded from the index, it has been suggested by John Mueller, a Webmaster Trends Analyst at Google, that excluded pages from the index are not used by Google and that it does not follow links on those pages.

“If we see the noindex there for longer than we think this this page really doesn’t want to be used in search so we will remove it completely. And then we won’t follow the links anyway. So, in noindex and follow is essentially kind of the same as a noindex, nofollow.” – John Mueller, Webmaster Trends Analyst at Google, Google Webmaster Hangout 15 Dec 2017

John also mentioned in discussions on Twitter after this announcement that Google will eventually drop all signals from an excluded page in its index.

“Nothing has changed there in a while (at least afaik); if we end up dropping a page from the index, we end up dropping everything from it. Noindex pages are sometimes soft-404s, which are like 404s.” – John Mueller, Webmaster Trends Analyst at Google, Twitter 28 December 2017

These comments do suggest that if paginated pages are excluded from Google’s index (regardless of the method) for a long period of time, then any paginated pages excluded would cause Google to eventually drop all outbound signals on those pages (internal links and content on the pages).

So, as already mentioned one of the most important functions of pagination on a website is to create access points for search engine crawlers to discover deeper level pages (products, articles, etc.).

Internal link structure for paginated pages

A quick test on Alexa websites, excluding paginated URLs from a web crawl, found that for certain websites, 30-50% of pages were reliant on pagination to be found by web crawlers through internal linking. An example of one these websites is below.

Screenshot of include and exclude paginated URLs

If businesses or site owners are excluding paginated URLs from Google’s index (using rel=canonical or noindex), then this could mean that Google will remove outbound signals (internal links) to deeper-level pages which are reliant on internal links from pagination.

In addition, if Google evaluates content based on the canonical pages indexed, then important signals like the link graph could also be based on Google-selected canonical URLs. If this theory is correct, then the deeper-level pages linked to from excluded paginated pages could effectively be orphaned and lose their ability to rank in search results.

Diagram of paginated pages removed from internal linking

To make sure internal links on paginated pages are followed, then it is important to ensure that they are indexed in Google.

When checking websites with pagination, we found that nearly 20% of paginated pages had been made non-indexable.

Pagination-results-indexable-non-indexable-paginated-pages

Again, we dug into the data to understand how site owners were causing their paginated pages to be non-indexable.

Canonicalized pages

The common pattern of why paginated pages were non-indexable is that websites used rel=canonical tag on paginated pages pointing back to the first page in the series.

Canonicalized paginated pages

As already mentioned, this will cause Google to exclude 2+ paginated pages from its index, and will mean that Google will not follow or count internal links to deeper level pages.

Actions:

Use third-party crawlers to detect noindex or canonicalized paginated pages on the website.
Inspect the paginated page URL index status in Google’s index using the URL Inspection tool.

Monitor the indexability of paginated pages using the Index Coverage Status report.

Duplicate content

The final factor is that paginated pages, that a site owner wants indexed, should contain unique content. The purpose of pagination is to be a list of articles or products which users can navigate. For paginated page content to be unique, it needs to contain unique list items and not overlap any list items.

Map URLs to paginated pages

When testing pagination, we found that 3% of paginated pages contained duplicate content (duplicate content detected by Lumar).

Results which found paginated pages with duplicate content

When digging into the data, duplicate content was one criterion that a lot of websites passed. This is great, as paginated pages will not be accidently excluded in Google’s index because the content is duplicated.

For those paginated pages that did have duplicate content, it appears that the content management system (CMS) created duplicates of the primary paginated URL.

Duplicate content on dynamic URLs

The issue appeared to be that the website owner had not managed the parameters on their website and a canonical URL has not been specified with the rel=canonical tag element. This is an example of where current SEO best practice of managing facets and filters on pagination still needs to be applied. Otherwise, Googlebot could waste crawl budget on duplicate content and could even choose to index duplicate content over the preferred URL.

Actions:

Use third-party tools like Lumar to discover duplicate paginated pages on a website.
Use duplicate or facet navigation SEO best practices to indicate the preferred paginated pages which should be used by Google as the canonical URL.

Load more + ‘lazy load’

Lazy load + ‘load more’ is a web design technique that divides lists of articles or products across multiple components, but defers loading of paginated components until they are loaded by users using a click event.

Lazy load web design

This web design technique to divide content across multiple pages was the second most popular in our crawl data. When comparing each website to the criteria, we found that 85% of the websites using load more + lazy load were not meeting the basic benchmarks.

This was an alarming figure when analysing the crawl data. As we went through each criteria, it was clear that a lot of websites with the load more design are missing the basic elements to make paginated pages discoverable.