1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

What are the best tools for scraping?

Discussion in 'Black Hat SEO' started by cunning, Dec 29, 2011.

  1. cunning

    cunning Newbie

    Joined:
    Dec 29, 2011
    Messages:
    8
    Likes Received:
    0
    Occupation:
    Occupation
    Location:
    Auckland, New Zealand
    I want to build a site which checks for news updates on a series of sites, and structures any updates made.

    What are some good tools for webscraping / spidering

    I have seen sphider and scrapybut was curious of others. I am wanting to structure the data so that it can then be used for analysis rather than just straight scraping.

    Also what are some good RSS to SQL importers?

    cheers!
     
  2. lolkittens

    lolkittens Newbie

    Joined:
    Dec 28, 2011
    Messages:
    19
    Likes Received:
    4
    If you know PHP/html then you can use "simple html dom", its a nice data scraping library.

    If you just want a software that will save to an SQL db file, then I would recommend "helium scraper".

    (google whats in the quotes to find them)
     
    • Thanks Thanks x 1
  3. phpbuilt

    phpbuilt Jr. VIP Jr. VIP

    Joined:
    May 16, 2011
    Messages:
    1,650
    Likes Received:
    5,208
    Occupation:
    $ from websites I own.
    Location:
    putting monkeys in paypal
    If its a web application you definitely want to use PHP. Here is a little snippet of code ...

    PHP:
    function scanpage($url){
        
    $ch curl_init();
        
    curl_setopt($chCURLOPT_URL$url);
        
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
        
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
        
    curl_setopt($chCURLOPT_HEADERtrue);
        
    $output curl_exec($ch);
        
    curl_close($ch);
        return 
    $output;
    }
    That function will take any web page and turn it into a variable you can extract information from. You call the function like this ...

    PHP:
    $mypage scanpage('http://domain.com');
    Now the $mypage variable has the entire contents of that web page inside it. Next use the following method to extract the exact info you want.

    Lets say the data you want always looks like this ...

    Code:
    <tr class="header"><td class="firstrow">Juicy information here</td><td class="secondrow">
    Next we'll cut all the info in front of the data out, then after the data, leaving us with only the info we want. We'll use something called a "token", which is 1 character that doesn't already exist on the page, in this case I'll use a carrot (^)

    PHP:
    $newpage str_replace('header"><td class="firstrow">','^',$mypage); // replace before text with a token
    $newpage str_replace('</td><td class="secondrow','^',$newpage); // stick another token after the data we want so we can segregate it
    $newpageE explode("^",$newpage); // turn page into array of 3 values, all the before stuff (we don't want), all the middle stuff (we do want) and all the after stuff (we don't want)
    $mydata $newpageE[1]; // the variable is now only the middle info, everything between the first and second token (^).  Exactly what we wanted to extract
    // specifically, the [1] stands for the 2nd value of the array (the middle).  [0] would have returned the first data before the first ^, and [2] would return all data after last ^.
    At this point, you could take the $mydata variable and do anything you want to with it. You can database it and have it automatically pull that data up on whatever web page of your website you want to.
     
    • Thanks Thanks x 1
  4. bananaman5000

    bananaman5000 Regular Member

    Joined:
    Mar 5, 2010
    Messages:
    200
    Likes Received:
    130
    php with CURL is a great option. you can also scrape from specific html tags using regular expressions, its fairly simple to set up.

    PHP:
    $html_to_scrape = <div id="example">(.*?)<\/div>; 

    //remember to escape the forward slashes

    //the (.*?) is a wildcard that means 'scrape everything that was here'

    preg_match($html_to_scrape$output$result);

    echo 
    $result;

    The $output in this case is from phpbuilt's CURL example. One thing to note: when you are putting the html in, it has to be exactly the same, even the white spaces, returns, etc. for it to work. Its best to just copy and paste it from the source in fact.
     
    Last edited: Dec 29, 2011
  5. cunning

    cunning Newbie

    Joined:
    Dec 29, 2011
    Messages:
    8
    Likes Received:
    0
    Occupation:
    Occupation
    Location:
    Auckland, New Zealand
    I second the thanks - that is awesome! :)

    I have been playing with GAWK as well as it seems to be more fuzzy in that it searches for certain patterns etc.

    And for scraping in general combining these with AutoHot Key seem pretty good to index a site.
     
  6. sirgold

    sirgold Supreme Member

    Joined:
    Jun 25, 2010
    Messages:
    1,260
    Likes Received:
    645
    Occupation:
    Busy proving the Pareto principle right
    Location:
    A hot one
    PHP is a great suggestion. It runs off basically any hosting plan and locally with Apache/Wamp/Xampp or even as a scripting language. Its XPath capabilities make it an excellent choice to traverse, parse and extract the DOM.

    Regular Expressions as suggested are another mandatory field you need to know if you wanna create your own scraper and customize it to perfection. Plus most real-world HTML is some patchy, messy and not really w3 compliant code so you can't skipe regex.

    If you wanna put something quick together and are comfortable with a command prompt you can use wget and grep, easily available for windows too.

    something like: wget -r site.com | grep -oE 'regex'

    will probably be the fastest way to quickly test your data with a simple command that recursively parses site.com and extracts your 'regex'.
     
    • Thanks Thanks x 2