How to Monitor XML Sitemaps

Adam Gent
Adam Gent

On 10th April 2019 • 11 min read

DeepCrawl can be used to monitor XML Sitemaps by using the crawl schedule and task management features.

This technique allows users to:

Please read how to set up Sitemap audits for further information on the advantages and disadvantages of different Sitemap crawl projects.
 

Create a Separate Project

Create a new project and use the primary domain whose XML Sitemaps you want to monitor. For example:

Primary Domain: https://www.deepcrawl.com/

XML Sitemaps to monitor:

If pages in the Sitemaps use JavaScript to load links, metadata and content then we recommend selecting Enable Javascript rendering.

In the Sources settings only include the Sitemaps data source.

Then add the XML Sitemap(s) using the three available options:

Choosing which option to include XML Sitemaps in your crawl entirely depends on your technical set up. Our guide on how to add XML Sitemaps to your project provides further information on how to use each option.

Once the XML Sitemaps are added using the chosen option, configure the Crawl Limits and Advanced settings in the next steps.

Once all the settings are all configured it is time to run a test crawl.
 

Running a Test Crawl

Running a test crawl is essential because it helps identify any configuration issues or problems in the crawl project.

At the moment DeepCrawl does not show the HTTP status codes or URLs it is crawling. This can be an issue if the crawler is finding 4xx or 5xx error codes and not actual pages that need to be crawled.
Once the settings of a crawl project have been configured, it is time to hit the Save & Crawl.

Wait for the test crawl to run and finalise. How long it takes to crawl the Sitemaps depends on how many were added.
 

Rsults of the Crawl

Once the crawl has finalised, it is time to check if there are any errors. Use the following reports to check for 4xx or 5xx status codes:

A large number of 5xx errors then this could indicate that the website’s server cannot handle the number of requests from DeepCrawl. If this is the case, then you may want to consider updating the Crawl Limit settings in the crawl project and discussing when the best time to crawl the website with your WebOps team.

As well as checking for URL errors, you can use DeepCrawl to identify if the XML Sitemaps added are valid. This will help understand if the Sitemap SEO metrics being reported on in DeepCrawl is accurate.

Read our guide on how to check if Sitemaps are valid for further information.
 

Identify Issues With XML Sitemaps

If there are no issues with the XML Sitemap crawl project, then the next step is to audit your Sitemaps and create tasks for any problems.

Read our guide on how to audit XML Sitemaps using DeepCrawl to identify potential issues with any XML Sitemaps added to the crawl project.

The most common reports in DeepCrawl which can show issues in XML Sitemaps are:

Report Contents
200 pages URLs which return a 2xx status code.
XML Sitemaps Sitemap URLs discovered and crawled by DeepCrawl.
Broken Sitemap Links URLs in Sitemap which return a 4xx or 5xx HTTP status code.
All Redirects URLs which return a 3xx HTTP status code or have a meta refresh tag.
Canonicalized and Noindexed Pages URLs which have a noindex tag (meta robots or X-robots) or a canonical tag which references another page.
Disallowed/Malformed URLs in Sitemaps URLs in a Sitemap which are disallowed in the /robots.txt file or are malformed.
Broken/Disallowed Sitemaps XML Sitemap URLs which returned a 4xx HTTP status code or were blocked in the robots.txt file.

 

The reports in the table above are just the most common reports used, and we encourage users to use reports which are useful to uncover unique XML Sitemap issues on their website. For example:

 

Create Tasks to Monitor Sitemap Issues

As each issue is identified, you can begin creating tasks in the DeepCrawl task management system.

When a task is created in the task management system, it is allocated to a report URL in DeepCrawl. Creating tasks will help users monitor specific issues within the XML Sitemap crawl project, which can then be emailed straight to your inbox.

When an issue is identified, create a task by clicking on the Task Manager icon in the top right and click on the Create a Task in the drop down and then:

  1. Add a title and description to the task.
  2. Add an email address where you want the tasks emailed.
  3. Set a priority based on your business and digital strategy goals.
  4. Once all the issues and tasks have been added, you need to schedule your crawl project.
  5. Set a deadline if applicable.
  6. Hit create.

 

Schedule Crawl Project

Now to schedule the crawl project so it will run on a schedule. Go to advanced settings in the crawl project → Schedule crawl option.

Select the time when the crawl will start and on what schedule you want the crawl to run. The schedule options include:

  1. Hourly
  2. Daily
  3. Weekly
  4. Fortnightly
  5. Monthly
  6. Quarterly

Use the schedule option which will help monitor changes alongside website updates.

For example, if your web development team have fortnightly sprints to push changes to the live website, then select the fortnightly option.

Then select a date and time for the crawl to start (e.g. 30/01/2019 at 04:00 am).

We’d then recommend setting the time of the crawl to just after any technical sprint. If it is set before then, you will miss changes or issues made to the XML Sitemaps.

null

The crawl will run the schedule from this date. For example, the next crawl will run on the 13/02/2019 at 04:00 am. The time zone will be shown in the settings.

If you wish to change the time zone, this can be done in the Account settings → Timezone.

That’s it! The monitoring crawl project for Sitemaps is set up.
 

Monitor Sitemaps From Your Inbox

Once the crawl project is scheduled to wait until the next crawl, once it is finalised, DeepCrawl will send you an email with all the tasks which you created in the app.

This email details all the tasks which are in the tasks tab. It highlights:

  1. Identified: Pages first identified when the task was created
  2. Remaining: All the pages found in the last crawl in the report
  3. Resolved: Number of pages no longer in the report
  4. Date task created: When the task was first created
  5. Title of task: Title inputted by the user
  6. Description of task: Description inputted by the user
  7. Emails: Emails added to tasks (who will receive this email)
  8. Priority: The priority of the task submitted by the user

These reports will be emailed to you every time DeepCrawl runs a crawl for the project.

If you notice an increase in the pages Remaining, or want to visit the report in DeepCrawl, click on the task title.

Clicking on the task title will send you directly to the report in DeepCrawl.

 

Summary

This technique can be used to monitor and schedule Sitemaps depending on your or your teams requirements. Using this technique allows users to:

  1. Schedule crawl projects to monitor multiple XML Sitemaps.
  2. Create prioritised tasks in the task management system in DeepCrawl.
  3. Get prioritised tasks emailed straight to a user’s inbox.
  4. Monitor Sitemaps without logging into the DeepCrawl app.

Author

Adam Gent
Adam Gent

Search Engine Optimisation (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

Subscribe today