Deepcrawl is now Lumar. Read more.
DeepcrawlはLumarになりました。 詳細はこちら

Search Engine Indexing

How search engine indexing works

What happens once a search engine has finished crawling a page? Let’s take a look at the indexing process that search engines use to store information about web pages, enabling them to quickly return relevant, high-quality results.

 

What’s the need for indexing by search engines?

Remember the days before the internet when you’d have to consult an encyclopedia to learn about the world and dig through the Yellow Pages to find a plumber? Even in the early days of the web, before search engines, we had to search through directories to retrieve information. What a time-consuming process. How did we ever have the patience?

Search engines have revolutionized information retrieval to the extent that users expect near-instantaneous responses to their search queries.

 

What is search engine indexing?

Indexing is the process by which search engines organize information before a search to enable super-fast responses to queries.

Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.

Indexability - Website Health and SEO Topics
View more search engine indexability resources in Lumar’s Website Intelligence Academy
 

What is an inverted index?

An inverted index is a system wherein a database of text elements is compiled along with pointers to the documents which contain those elements. Then, search engines use a process called tokenization to reduce words to their core meaning, thus reducing the amount of resources needed to store and retrieve data. This is a much faster approach than listing all known documents against all relevant keywords and characters.

An example of inverted indexing

Below is a very basic example that illustrates the concept of inverted indexing. In the example, you can see that each keyword (or token) is associated with a row of documents in which that element was identified.

KeywordDocument Path 1Document Path 2Document Path 3
SEOexample.com/seo-tipsmoz.com
HTTPSdeepcrawl.co.uk/https-speedexample.com/https-future

This example uses URLs but these might be document IDs instead depending on how the search engine is structured.

 

The cached version of a page

In addition to indexing pages, search engines may also store a highly compressed text-only version of a document including all HTML and metadata.

The cached document is the latest snapshot of the page that the search engine has seen.

The cached version of a page can be accessed (in Google) by clicking the little green arrow next to each search result’s URL and selecting the cached option. Alternatively, you can use the ‘cache:’ Google search operator to view the cached version of the page.

Bing offers the same facility to view the cached version of a page via a green down arrow next to each search result but doesn’t currently support the ‘cache:’ search operator.

 

What is PageRank?

“PageRank” is a Google algorithm named after the co-founder of Google, Larry Page (yes, really!) It is a value for each page calculated by counting the number of links pointing at a page in order to determine the page’s value relative to every other page on the internet. The value passed by each individual link is based on the number and value of links that point to the page with the link.

PageRank is just one of the many signals used within the large Google ranking algorithm.
An approximation of the PageRank values were initially provided by Google but they are no longer publicly visible.

While PageRank is a Google term, all commercial search engines calculate and use an equivalent link equity metric. Some SEO tools try to give an estimation of PageRank using their own logic and calculations. For example, Page Authority in Moz tools, TrustFlow in Majestic, or URL Rating in AhrefsLumar has a metric called DeepRank to measure the value of pages based on the internal links within a website.

 

How PageRank flows through pages

Pages pass PageRank, or link equity, through to other pages via links. When a page links to content elsewhere it is seen as a vote of confidence and trust, in that the content being linked to is being recommended as relevant and useful for users. The count of these links — and the measure of how authoritative the linking website is — determines the relative PageRank of the linked-to page.

PageRank is equally divided across all discovered links on the page. For example, if your page has five links, each link would pass 20% of the page’s PageRank through each link to the target pages. Links that use the rel=”nofollow” attribute do not pass PageRank.

 

Backlinks are a cornerstone of how search engines understand the importance of a page. There have been many studies and tests performed to identify the correlation between backlinks and rankings.

Research into backlinks by Moz shows that results for the top 50 Google search queries (~15,000 search results), 99.2% of these had at least 1 external backlink. On top of this, SEOs consistently rate backlinks as one of the most important ranking factors in surveys.

Next Chapter: Search Engine Differences


 

The Full Guide to How Search Engines Work:

how do search engines work?

How Do Search Engines Work?

how search engines crawl websites

How Search Engines Crawl Websites

How search engine indexing works

How Does Search Engine Indexing Work?

what are the differences between search engines?

What are the Differences Between Search Engines?

What is crawl budget? How does it impact SEO?

What is Crawl Budget?

what is robots.txt used for? An SEO guide to robots txt

What is Robots.txt? How is Robots.txt Used by Search Engines?

tech seo tips for url-level robots.txt directives

A Guide to Robots.txt Directives

 


Additional learning resources:

Indexability Best Practices (Lumar Website Intelligence Academy)

Learn More About Search Engine Indexing & SEO


Start building better online experiences today

Lumar is the intelligence & automation platform behind revenue-driving websites

Avatar image for Sam Marsden
Sam Marsden

SEO & Content Manager

Sam Marsden is Lumar's former SEO & Content Manager and currently Head of SEO at Busuu. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

Newsletter

Get the best digital marketing & SEO insights, straight to your inbox