1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Looking for webscraping resources....

Discussion in 'General Programming Chat' started by phatzilla, Jun 22, 2014.

  1. phatzilla

    phatzilla Supreme Member

    Joined:
    Apr 9, 2009
    Messages:
    1,366
    Likes Received:
    1,017
    I am interested in learning how to use a more modern up-to-date programming/scripting language to "scrape/automate" the web.

    I already know how the basic concepts work (GET/POST requests, querystrings,postdata,user-agents,headers,cookies,proxies) However it's probably safe to say that the language i currently use (autoit scripting) is a bit long in the tooth and it might be time for a change since it doesn't even support proper multithreading...

    I am not looking to build advanced and optimized desktop applications that take a year to finalize, but would rather learn a language that is MODERN and simple/powerful which also has rapid deployment capabilities (this is why i like auto-it) for my web automating needs. Ive read about python/rubyonrails/nodejs but i'd like to hear it from fellow blackhatters that actually build their own interesting custom web bots. I know i am not going to become some super badass programmer all of a sudden, i'd just like to slap some cool web automation tools together, because it's the most fun to design them yourself, and it's also about time to graduate from autoit. Some resources would be nice to read over.......

    Cheers
     
  2. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    950
    Likes Received:
    662
    Occupation:
    Web/Bot Developer
    I've pretty much replaced all my old school Python based scraping and automation tools for PhantomJS + CasperJS.

    PhantomJS is a headless WebKit scriptable with a JavaScript API.
    Code:
    http://phantomjs.org/
    CasperJS is an open source navigation scripting & testing utility written in Javascript for PhantomJS
    Code:
    http://casperjs.org/
     
    • Thanks Thanks x 1
    Last edited: Jun 23, 2014
  3. bighomie

    bighomie Registered Member

    Joined:
    Oct 6, 2013
    Messages:
    98
    Likes Received:
    43
    Occupation:
    Online hustlin
    Location:
    ******
    I've been using Python to scrape the web for a couple of months now. I'm using Python 2.7.3 and a couple of libraries(Beautiful Soup + mechanize) for my web scarping needs. There are a couple of other libraries to use out there, like scrapy, but I have no experience with them.

    I have not use any other languages to scrape the web as Python has everything I need. I luv you Python <3
     
  4. todordonev

    todordonev Regular Member

    Joined:
    Nov 23, 2012
    Messages:
    379
    Likes Received:
    228
    Gender:
    Male
    Location:
    Bulgaria
    Home Page:
    Good old uBot works perfectly for scraping, although it isn't even close to fast.
     
  5. phatzilla

    phatzilla Supreme Member

    Joined:
    Apr 9, 2009
    Messages:
    1,366
    Likes Received:
    1,017
    Ubot isn't actual programming, though, and i can already formulate strict http web request type programs with auto-it as it is. It's just limited.
     
  6. TheeAriGrande

    TheeAriGrande Regular Member

    Joined:
    Jul 14, 2013
    Messages:
    270
    Likes Received:
    151
    Location:
    Candlestick Park
    Last edited: Jun 23, 2014
  7. dgruergerugerhiye

    dgruergerugerhiye BANNED BANNED Jr. VIP Premium Member

    Joined:
    Nov 4, 2010
    Messages:
    305
    Likes Received:
    450
    Ruby + Mechanize, or Ruby + Watir for browser driven stuff.
     
  8. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    C# + HTML Agility pack or SGMLReader.
     
  9. zohar

    zohar Newbie

    Joined:
    Jun 24, 2014
    Messages:
    44
    Likes Received:
    5
    I am in the process of writing one myself. One tip: use some .NET language with a patched (aka fully enabled) IE/Webbrowser component.

    Its extrmely hard to find source-code of a working control, but once you found it you basically won the jackpot. It's out there somewhere.

    .NET might not be the fastest thing around, but IMHO your server is as fast as the amount of money u have in your pocket.

    Good luck.
     
    Last edited: Jun 24, 2014
  10. k0d3r

    k0d3r Newbie

    Joined:
    Feb 17, 2013
    Messages:
    36
    Likes Received:
    28
    Location:
    Keyboard
    C# ?? javascript? C++? :nono:

    Life is short, use python! :)
     
  11. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    I've been using C# for the past few years solely for scraping, because it's easy and quick. Yes, there are bugs and yes, there are things i don't like.
    Do you know that majority of the marking softwares flying out there are coded in .NET - zenno poster, mass video blaster, proxy multiply and many more.

    IF python allows you to use XPATH over HTML with no additional cost, than i will vote for it, regardless of the awkward syntax.

    EDIT: I always wanted to test the performance between python and .net, but never had the time to learn the basics.
    I'm certain that there will be almost no difference, but i am curious about parsing the html.
    In .NET, there is no mechanism of parsing the html, but using Regex or converting the html to xml(which is very costly, performance wise) and use xpath.
     
    Last edited: Jun 25, 2014
  12. Gary Becks

    Gary Becks Power Member

    Joined:
    Apr 11, 2010
    Messages:
    675
    Likes Received:
    282
    Location:
    Atl
    Home Page:
    python + beatifulsoup or scrapy
     
  13. Gogol

    Gogol Elite Member

    Joined:
    Sep 10, 2010
    Messages:
    3,063
    Likes Received:
    2,872
    Gender:
    Male
    Well I am not sure if you would use PHP, but here's some code to help you get started with PHP web scrapers.

    The CURL function which fetches html from any given url (you can use file_get_contents, but curl is much more advanced):
    Code:
    function curly($url) {
        $proxy_list = get_option('proxy_list');
        $proxies = @explode("\n", $proxy_list);
        $proxy_support = false;
        if (!empty($proxies)) {
            $random_proxy = $proxies[rand(0, (count($proxies) - 1))];
            if (!empty($random_proxy)) {
                $proxy_support = true;
            }
        }
        $agent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3';
    
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_VERBOSE, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // for redirects
        curl_setopt($ch, CURLOPT_USERAGENT, $agent);
        curl_setopt($ch, CURLOPT_REFERER, "http://tech5.net");
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
        curl_setopt($ch, CURLOPT_TIMEOUT, 400);
        curl_setopt($ch, CURLOPT_POST, false);
        if ($proxy_support)
            curl_setopt($ch, CURLOPT_PROXY, $random_proxy);
        curl_setopt($ch, CURLOPT_URL, $url);
        return curl_exec($ch);
    }
    
    I am using Google search result for the fetch example. Code commented for easy understanding:

    Code:
    $url = 'https://www.google.co.in/search?num=50&safe=off&client=firefox-a&hs=Mn8&rls=org.mozilla:en-US:official&channel=rcs&q=bottlenecks&spell=1&sa=X&ei=q_erU6aTKo-9uAT074GgDQ&ved=0CBoQvwUoAA&biw=1366&bih=587';
    
    $html = curly($url);//this function is defined in the last example. Include this function in your script
    $dom = new DOMDocument();
    
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom); 
    /* Google shows the links for a search result in a tag inside h3 tags having class "r". So we are doing //h3[@class="r"]//a*/
    $links = $xpath->query('//h3[@class="r"]//a');
    $length = $links->length; // total number of links fetched
    $all_links = array(); // this will store your links
    
    for ($i = 0; $i < $length; $i++) {
      $element = $links->item($i);
      /* find the  href attribute of every element and put the first value in $all_links */
      $all_links[] = $xpath->evaluate('@href', $element)->item(0)->value;
    }
    
    /* now your variable $all_links has all the links. do something with it*/
    
    Hope it helps to get you started :)
     
  14. Schvamp

    Schvamp Power Member

    Joined:
    Feb 13, 2012
    Messages:
    684
    Likes Received:
    549
    Location:
    Hogwarts
  15. Gogol

    Gogol Elite Member

    Joined:
    Sep 10, 2010
    Messages:
    3,063
    Likes Received:
    2,872
    Gender:
    Male
    Why use Simple HTML Dom when PHP has inbuilt DOM classes for the same purpose? Check my previous post for example. It is a lot more faster than Simple HTML Dom because this is pre-compiled code.
     
  16. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    What about PHP and multithreading ?
     
  17. Schvamp

    Schvamp Power Member

    Joined:
    Feb 13, 2012
    Messages:
    684
    Likes Received:
    549
    Location:
    Hogwarts
    OP doesn't seem to have any knowledge about PHP. I think jumping into CURL right away might be a huge step.