Scraping images with DeepCrawl

Mathieu Burgerhout
Mathieu Burgerhout

On 18th August 2015 • 6 min read

Last week I got an offer to dropship approximately 7,300 products from a supplier. The supplier said, he had a data feed I could use to build my store from. Sounds good right? Well, the naked truth was that the data feed was nothing more than a Magento product export. Sometimes that’s just fine, but this export was missing one quite important detail: all product images!

 

To import or not to import

I began to doubt whether to set up a store based on only these details. I mean, what kind of self-respecting webshop doesn’t have any product images? When I contacted the supplier, he promised he would send the missing images. Problem solved right? Not exactly. I received a WeTransfer link of 136MB. Unfortunately, the file that was draining my bandwidth, was in fact one big PDF. It was their offline catalogue, scanned and saved as PDF, sigh.

 

Ready, set, scrape!

So, what was I supposed to do with a potentially nice opportunity to set up a webshop in a great niche? Well, scrape it, of course! If they won’t give me the images, I’ll take them myself since I already had permission to use them anyway.
Before I began, I considered what I already knew:

I decided I had enough anchors to get going. Let the scraping commence!

 

Back to DeepCrawl

Last time I described (in Dutch) how I identified the taxonomy on certain pages, using DeepCrawl. Since I was in need (again) for some serious crawling, what other choice than to use DeepCrawl again.

My SEO buddy, Monchito, mentioned why I preferred DeepCrawl over other SEO tools. Simple, DeepCrawl can actually extract data, whereas most others cannot.

I used the following settings in DeepCrawl:

  1. Only crawl product pages. Using positive restriction to only crawl the product pages and not all categories, filters, etc. Fortunately all product pages end with .html, whereas all categories end with a trailing slash. The positive restriction saved me a lot of time and valuable crawl credits. Only crawl product pages like this:

Included-only

  • Don’t crawl any filters. If I allowed all filters available to be crawled on the pages, I would run out of DeepCrawl credits. So, all Magento filters would have to go. Webshop owner, please be aware of these filters. Apply robot restrictions.
  • Mathieu's negative restriction

  • Get the product image URL’s. First I fiddled with Rubular to get my Regex right. It’s important to collect the URL of the image. On every page there is a link to the main product image, which is triggers a fancy lightbox. The class and ID are the same on every product page, which makes it easy to extract the image URL. We just have to match the syntax and extract the direct link to the image.
  • Mathieus-custom-extraction
     

    Wrapping it up

    Overnight DeepCrawl did what it does best – an excellent crawling job. The next morning I was greeted with an email telling me the “Crawl has finished”. Opening the extract report, I saw a number I was quite pleased with: 7,762. A tad more than I expected, but a quick look showed me that all static pages have the .html extension too. No problem, since I can look up only URL’s containing an SKU. That left me with 7,321 URL’s. Spot on with the amount of products in my Excel sheet.

    Mathieu-extract-report

    With the URL’s of the product pages along with the extracted image URL’s, I could match them to complete my own Excel sheet. Using some vertical lookups, I was able to pair the SKU’s supplied with the extracted product URL’s that conveniently contained the SKU (e.g. /product-name-129482.html). This enabled me to look up and match the right URL’s containing the images. Splendid!

    I then had a complete data sheet to import in a fresh new environment. From there I was able to experiment with some SEO and SEA stuff, try out some wicked CRO ideas, work on my link acquisition techniques and hopefully make some sales along the way. That makes me one happy robot.

     

    Closing up while my store is open for business

    So, I now have a fully operational shop (by now you must all be aware that I can’t disclose the website) following some quick and easy crawling. All thanks to DeepCrawl. Sure, I missed extracting the breadcrumbs, which resulted in me categorizing all the products by hand after all. I later found out that it was next to impossible since Magento don’t have actual breadcrumbs on product pages, but more something like click paths. But hey, maybe next time 🙂

    Author

    Mathieu Burgerhout
    Mathieu Burgerhout

    With over 10 years of SEO and Online Marketing experience my guarantee is to improve your business. I am a true specialist driven by extensive knowledge in various fields regarding content strategies, link building tactics, web analytics and web technology.

     

    Tags

    Get the knowledge and inspiration you need to build a profitable business - straight to your inbox.

    Subscribe today