DeepCrawl Log File Integration: Logz.io

Adam Gent
Adam Gent

On 4th November 2019 • 15 min read

At DeepCrawl, we understand that log file analysis is one of the most critical data sources in search engine optimization. Log file data allows SEO teams to identify how search engine crawlers are accessing the website and can help troubleshoot crawling and indexing issues on a website.

To help our customers easily pull log file data into DeepCrawl, we have partnered with Logz.io.

Our team has written this guide to help customers answer the following questions:

 

What is Logz.io?

Loz.io is a company that provides log management and log analysis services. Their platform combines ELK as a cloud service and machine learning to derive new insights from machine data.

What is ELK?

ELK stack is an acronym used to describe an AWS log management platform consisting of three popular open-source projects: Elasticsearch, Logstash, and Kibana.

ELK gives customers the ability to:

  1. Aggregate logs from all systems and applications
  2. Analyze these logs, and
  3. Create visualizations for application and infrastructure monitoring.

 

How does Logz.io work with DeepCrawl?

Logz.io and DeepCrawl work together as follows:

  1. An account is created in Logz.io.
  2. Log files are shipped from the web server to Logz.io.
  3. Log files are aggregated and stored in Logz.io.
  4. An API token is generated in the Logz.io account.
  5. This API token is then saved in DeepCrawl.
  6. The API token is then used in the set up of the Logz.io connection in DeepCrawl.
  7. A query is then created in DeepCrawl to fetch log file data through the API.
  8. DeepCrawl sends an authentication request to the API using the API token.
  9. Logz.io API accepts the token and allows DeepCrawl to start requesting log file data based on the query.
  10. The log file data is crawled, processed, and visualized in DeepCrawl, along with other data sources.

 

How to set-up and configure Logz.io and DeepCrawl

When integrating with Logz.io, it is crucial to make sure:

  1. Log files are shipping to the Logz.io ELK stack, and
  2. DeepCrawl is set-up to access log files from Logz.io.

Let's go through these two steps in detail.

Shipping log files to Logz.io

Before DeepCrawl can crawl log files, they need to be shipped to Logz.io ELK stack for storage and processing.
Logz.io provides a list of all the shipping solutions in their documentation and the dashboard.

Logz.io log shipping

Our team strongly recommends going through the shipping options with your engineering or IT teams to better understand how to ship logs into Logz.io for your web server.

The most common shipping methods for Logz.io are Filebeat and Amazon S3 fetcher.

Filebeat

Filebeat is a lightweight open-source log shipping agent installed on a customer's HTTP server. Logz.io recommends it because it is the easiest way to get logs into their system.

Logz.io also provide documentation on how to set-up Filebeat on the most common web servers:

Amazon S3 fetcher

Amazon Web Services can be used to ship certain services (for example, Cloudfront log files) to an S3 bucket, where the Logz.io Amazon S3 fetcher can be used to fetch the logs.

Our team recommends going through the documentation and technical set-up around this option with your engineering and/or IT teams.

Allowing DeepCrawl access to Logz.io

Once the log files have been shipped to the ELK stack in Logz.io, the DeepCrawl connection needs to be set-up.

1. Go to the Connected Apps page in DeepCrawl.

Connect Logz.io with DeepCrawl

2. Click “Add Logz.io account” and navigate to Logz.io.

Enter Logz.io token

3. Go to the Logz.io dashboard and click on the cog icon in the top right. Go to Tools > API tokens.

Generate Logz.io token

4. Click on “Add API token” and create an API token for DeepCrawl.

Add Logz.io API token

5. Copy and paste the API token into the Logz.io account info on the Connected Apps page in DeepCrawl.

Add Logz.io API token to DeepCrawl

6. The API token will then be saved in the Connected Apps page in your account.

Logz.io Added to DeepCrawl

 

Adding Log File data to a crawl

1. To select log files as a data source from Logz.io navigate to Step 2 in advance settings.

Logz.io Data Source

2. Scroll down to the Log Summary source, select Logz.io Log Manager and click on the “Add Logz.io Query” button.

Add Logz.io Query

3. The “Add Logz.io Query” button will open a query builder which, by default, contains pre-filled values for the most common server log file setup (more information about values below).

Logz.io and DeepCrawl Query Builder

4. Once the query builder is configured, hit the save button to allow DeepCrawl to crawl URLs from Logz.io API.

 

How to configure the query builder

The query builder can be used to customize the default values in DeepCrawl.

Query Builder Data Fields

The query builder requires analysis and editing to make sure that DeepCrawl can pull in log file data from Logz.io.

Our team has described each field below and how to make sure each can be checked to make sure the query builder is correctly set-up.

Base URL

The base URL value is the HTTP scheme (ie. https://) and domain (www.deepcrawl.com), which will be used to append to the beginning of relative URLs (.i.e. /example.html) in log file data.

If it is left blank, the primary domain in the project settings will be used by default.

Please make sure that the URLs in the log file data are from the primary domain you want to crawl, otherwise DeepCrawl will flag a large number of crawl errors.

Token

The Logz.io token value is the API token used to connect DeepCrawl to your Logz.io account.

Please make sure that the API token used is still active, and is created in the correct account.

Date Range

The date range value uses days as a metric, and DeepCrawl will collect logs in the timeframe of the days inputted into this field.

By default, the date range is set to 30 days. Check if the date range used in your Logz.io account contains log file data.

Desktop user-agent regex

This field tells DeepCrawl only to fetch data from Logz.io which has a specific desktop user-agent string.

Our team has provided the regex for Googlebot desktop only hits below:

Mozilla/5.0 \(compatible; Googlebot/2.1; \+http://www.google.com/bot.html\)|Mozilla/5.0 AppleWebKit/537.36 \(KHTML, like Gecko; compatible; Googlebot/2.1; \+http://www.google.com/bot.html\) Safari/537.36

If you require customized user-agent strings, please get in contact with your customer success manager.

Mobile user-agent regex

This field tells DeepCrawl only to fetch data from Logz.io, which has a specific mobile user-agent string.

Our team has provided the regex for Googlebot mobile only hits below:

.*Android.*Googlebot.*

If you require customized user-agent strings, please get in contact with your customer success manager.

Max number of URLs

This is the maximum number of URLs you want to fetch from the Logz.io API.

Please be aware that this field will not override the total URL limit in the project settings.

URL field name

The URL field name is the name of the URL column in the Logz.io database. This field helps DeepCrawl lookup the column which lists all relative URL rows in Logz.io, and fetch the pages which are in the log files.

By default, the query builder will look for a column called “request”. For most websites, this will allow DeepCrawl to lookup the right column and fetch the relevant URL rows.

Query Builder URL Field Name

However, each website is unique, and different tech stacks can cause these columns to be named differently. This means that sometimes, the URL field name will need to be updated.

To do this, navigate to the Logz.io dashboard > Kibana > Discover.

Logz.io Kibana Discover

Click on the arrow icon next to the top row of the log file data.

Logz.io Kibana Discover Dropdown

The arrow icon will open up all the columns and data for that specific hit.

Logz.io Fields

In this drop-down look for the column with the URL, which was requested in the log file data, be careful not to mix up the source URL column.

Logz.io Request Fields

Once you have identified the URL, make a note of the name of the column. In the example screenshot below, this is “request”.

Logz.io Request Field

Go back to the query builder in DeepCrawl and make sure the URL field name matches the name of the column.

Query Build URL Field Name Field

User agent field name

The user agent field name is the name of the user agent column in the Logz.io database. This field helps DeepCrawl lookup the column, which lists all the user agent strings in Logz.io and applies the user agent regex to filter out particular bot hits.

By default, the query builder will look for a column called “agent”. For most websites, this will allow DeepCrawl to lookup the right column and fetch the relevant URLs with particular user agents.

Query Build Agent Name Field

However, each website is unique, and the different tech stacks can cause these columns to be named differently. This means that sometimes the user agent field name will need to be updated.

To do this, navigate to the Logz.io dashboard > Kibana > Discover.

Logz.io Kibana Discover

Click on the arrow icon next to the top row of log file data.

Logz.io Kibana Discover Dropdown

The arrow will open up all the columns and data for that specific hit.

Logz.io Fields

In this drop-down look for the column with the user agent string, which requested the log file data.

Logz.io Agent Field

Once you have identified the user agent column name, please make a note of the name it is using. Go back to the query builder in DeepCrawl and make sure the URL field name matches the name of the column.

Query Builder User Agent Field Name

 

Filtering log file data in the query builder

The DeepCrawl query builder can also filter data which is fetched from Logz.io using JSON.

DeepCrawl Query Filter

For example, if you wanted to filter on a specific domain or subdomain.

Our team recommends getting in touch with the customer success team if you want to filter using JSON.

 

Frequently Asked Questions

Should I run a sample crawl?

Yes, our team always recommends running a sample crawl when a new Logz.io query has been set-up as a crawl source.

Running a sample crawl will prevent you from having to wait for a massive crawl to finish only to discover there was an issue with the settings. It also helps to reduce the number of wasted credits.

Why is DeepCrawl not pulling in log file data?

These are the most common reasons why Logz.io may not be pulling in data:

If log files are still not being pulled in after these issues have been resolved, then we recommend getting in touch with your customer success manager.

Why is DeepCrawl not able to crawl log file data?

Sometimes log file data is being pulled in correctly, but due to other issues, the crawl still fails.

Our team also recommends reading the how to debug blocked crawls and how to fix failed website crawls documentation.

 

Further questions on Logz.io?

If you’re still having trouble setting up Logz.io, then please don’t hesitate to get in touch with our support team.

Author

Adam Gent
Adam Gent

Search Engine Optimisation (SEO) professional with over 8 years’ experience in the search marketing industry. I have worked with a range of client campaigns over the years, from small and medium-sized enterprises to FTSE 100 global high-street brands.

Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

Subscribe today