1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping tool suggestion with custom HTML footprint.

Discussion in 'Black Hat SEO Tools' started by judson, Nov 16, 2012.

  1. judson

    judson Power Member

    Joined:
    Nov 29, 2009
    Messages:
    530
    Likes Received:
    319
    Occupation:
    Fulltime Newbie IM
    Location:
    Sub Ubi
    I have a list of URLs. Almost 1,000 in a text file.

    What I need to do, is to be able to check the page that loads for each of those URLs.

    I do have a footprint I want to test for.

    The fly in the buttermilk, is that the footprint is going to be something like <div class="blahblah" .... so not something in Google.

    Can anyone suggest a tool that could help.
     
  2. crazyshark

    crazyshark Newbie

    Joined:
    Aug 5, 2012
    Messages:
    6
    Likes Received:
    0
    Home Page:
    Hello I am a bit new to BHW, but if you explain in some more detail I may be able to help you. I am a software guy and I build softwares easy. I think I have something up my sleeve that can help you, but not sure if I have got your requirements right.
    Regards,
    - Will
     
  3. ikstob

    ikstob Junior Member

    Joined:
    Nov 12, 2012
    Messages:
    147
    Likes Received:
    129
    Location:
    ikstob.com
    Home Page:
    I'm just about to release a free/open-source set of tools that can do exactly this. It's all written in Java so will work on any platform (Windows, Mac, Linux) and allows you to pull pages direct or through proxies, multi-threaded and do sophisticated filters/expressions using jQuery-like selectors .. e.g. "div.blahblahblah" will match all DIV elements with the CSS class ".blahblahblah".

    It should be ready/available this weekend if you are interested!
     
    • Thanks Thanks x 1
  4. judson

    judson Power Member

    Joined:
    Nov 29, 2009
    Messages:
    530
    Likes Received:
    319
    Occupation:
    Fulltime Newbie IM
    Location:
    Sub Ubi
    Hey man.

    Thank you for this.

    I am definitely interested.

    The only other alternative I have is to wget the pages, and then grep through them.