1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scrape Detailed Website Information - Stumped - Help!

Discussion in 'Black Hat SEO' started by Bojyy, Jul 30, 2015.

  1. Bojyy

    Bojyy Newbie

    Joined:
    Jul 24, 2015
    Messages:
    17
    Likes Received:
    3
    Hey guys.

    So I'm stumped and need your help...

    For example: iherb.com

    I'd like to crawl and capture the following data:

    1) UPC
    2) Product Codes
    3) Brand
    4) Description
    5) SRP
    6) Discounted Price
    7) Ratings
    8) Reviews
    9) Shipping Weight
    10) Package Quantity
    11) Dimensions
    12) Price Bundles
    13) Instock / Out of Stock

    UPC codes can be problematic in that many begin with a ?0? and I need to capture any ?0?s that may precede a value.
    I?d like to be able to run this either on demand or as a scheduled routine. If I could export or gather this information within an excel sheet, that'd be amazing.

    Does anyone have any insight on how I can achieve this or know any existing software? Thanks so much for taking the time!
     
  2. Atomic76

    Atomic76 Registered Member

    Joined:
    May 24, 2014
    Messages:
    67
    Likes Received:
    37
    You could possibly do it with the SEOTools plugin for Excel. It adds a function to Excel called XPathOnURL which can scrape data from a URL. There are two parts to the Excel formula - you just need the URL you want to scrape the data from (which would be the product page), and the XPath of the page element itself you want to capture. There are tutorials out there that go into more detail how to use this, including how to get the XPath with Chrome browser's Inspect Element feature, as well as some minor tweaks you need to make to it when inserting it into the Excel formula. Incidentally, Google Spreadsheets also has a similar function baked into it called ImportXML which works similarly.

    As for how to get all the product page URLs, you could either use Xenu Link Sleuth (free) to spider all the pages on the site, or even better would be Screaming Frog SEO Spider (you'll need the paid version since the free trial only spiders 500 pages).

    As long as the product pages follow a similar URL structure, you should be able to filter the URL list down to those specifically pretty easily in Excel, by filtering the list by some URL pattern. Also, the product pages themselves will need to be in a consistent layout in order for this to work - if they don't all follow the same page structure and the data points you are trying to scrape aren't located within the same spot within the HTML, this method won't work. XPath works by finding things relative to some parent HTML element, such as the 2nd paragraph within a DIV of a specific ID.

    Another option would be to use a program such as WebHarvy - which is a different approach. There are tutorials on Youtube to show how this program works, but it's a much slower process.
     
  3. Bojyy

    Bojyy Newbie

    Joined:
    Jul 24, 2015
    Messages:
    17
    Likes Received:
    3
    Awesome, thank you for the information! I have a paid subscription to Screaming Frog anyway so at least I have that covered. I'll be trying this out and i'll keep you posted as to what works. I really appreciate the help.