Deepcrawl is now Lumar. Read more.
DeepcrawlはLumarになりました。 詳細はこちら

Search Engine Crawling

how search engines crawl websites

Now that you’ve got a top-level understanding of how search engines work, let’s delve deeper into the processes that search engines and web crawlers use to understand the web. Let’s start with the crawling process.

 

What is search engine crawling?

Crawling is the process used by search engine web crawlers (bots or spiders) to visit and download a page and extract its links in order to discover additional pages.

Pages known to the search engine are crawled periodically to determine whether any changes have been made to the page’s content since the last time it was crawled. If a search engine detects changes to a page after crawling a page, it will update it’s index in response to these detected changes.

The SEO Funnel Stage 3 - Website Crawlability
View more search engine crawlability resources in Lumar’s Website Intelligence Academy
 

How does web crawling work?

Search engines use their own web crawlers to discover and access web pages.

All commercial search engine crawlers begin crawling a website by downloading its robots.txt file, which contains rules about what pages search engines should or should not crawl on the website. The robots.txt file may also contain information about sitemaps; this contains lists of URLs that the site wants a search engine crawler to crawl.

Search engine crawlers use a number of algorithms and rules to determine how frequently a page should be re-crawled and how many pages on a site should be indexed. For example, a page that changes on a regular basis may be crawled more frequently than one that is rarely modified.

 

How can search engine crawlers be identified?

The search engine bots crawling a website can be identified from the user agent string that they pass to the web server when requesting web pages.

Here are a few examples of user agent strings used by search engines:

  • Googlebot User Agent
    Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
  • Bingbot User Agent
    Mozilla/5.0 (compatible; bingbot/2.0; +https://www.bing.com/bingbot.htm)
  • Baidu User Agent
    Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)
  • Yandex User Agent
    Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots)

Anyone can use the same user agent as those used by search engines. However, the IP address that made the request can also be used to confirm that it came from the search engine – a process called reverse DNS lookup.

 

Crawling images and other non-text files

Search engines will normally attempt to crawl and index every URL that they encounter.

However, if the URL is a non-text file type such as an image, video or audio file, search engines will typically not be able to read the content of the file other than the associated filename and metadata.

Although a search engine may only be able to extract a limited amount of information about non-text file types, they can still be indexed, rank in search results and receive traffic.

You can find a full list of file types that can be indexed by Google available here.

 

Crawlers discover new pages by re-crawling existing pages they already know about, then extracting the links to other pages to find new URLs. These new URLs are added to the crawl queue so that they can be downloaded at a later date.

Through this process of following links, search engines are able to discover every publicly-available webpage on the internet which is linked from at least one other page.

 

Sitemaps

Another way that search engines can discover new pages is by crawling sitemaps.

Sitemaps contain sets of URLs, and can be created by a website to provide search engines with a list of pages to be crawled. These can help search engines find content hidden deep within a website and can provide webmasters with the ability to better control and understand the areas of site indexing and frequency.

 

Page submissions

Alternatively, individual page submissions can often be made directly to search engines via their respective interfaces. This manual method of page discovery can be used when new content is published on site, or if changes have taken place and you want to minimize the time that it takes for search engines to see the changed content.

Google states that for large URL volumes you should use XML sitemaps, but sometimes the manual submission method is convenient when submitting a handful of pages. It is also important to note that Google limits webmasters to 10 URL submissions per day.

Additionally, Google says that the response time for indexing is the same for sitemaps as individual submissions.

Next Chapter: Search Engine Indexing


 

The Full Guide to How Search Engines Work:

how do search engines work?

How Do Search Engines Work?

how search engines crawl websites

How Search Engines Crawl Websites

How search engine indexing works

How Does Search Engine Indexing Work?

what are the differences between search engines?

What are the Differences Between Search Engines?

What is crawl budget? How does it impact SEO?

What is Crawl Budget?

what is robots.txt used for? An SEO guide to robots txt

What is Robots.txt? How is Robots.txt Used by Search Engines?

tech seo tips for url-level robots.txt directives

A Guide to Robots.txt Directives


 

Additional Learning Resources


Start building better online experiences today

Lumar is the intelligence & automation platform behind revenue-driving websites

Avatar image for Sam Marsden
Sam Marsden

SEO & Content Manager

Sam Marsden is Lumar's former SEO & Content Manager and currently Head of SEO at Busuu. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

Newsletter

Get the best digital marketing & SEO insights, straight to your inbox