Looking for webscraping resources....

phatzilla · Jun 22, 2014

I am interested in learning how to use a more modern up-to-date programming/scripting language to "scrape/automate" the web.

I already know how the basic concepts work (GET/POST requests, querystrings,postdata,user-agents,headers,cookies,proxies) However it's probably safe to say that the language i currently use (autoit scripting) is a bit long in the tooth and it might be time for a change since it doesn't even support proper multithreading...

I am not looking to build advanced and optimized desktop applications that take a year to finalize, but would rather learn a language that is MODERN and simple/powerful which also has rapid deployment capabilities (this is why i like auto-it) for my web automating needs. Ive read about python/rubyonrails/nodejs but i'd like to hear it from fellow blackhatters that actually build their own interesting custom web bots. I know i am not going to become some super badass programmer all of a sudden, i'd just like to slap some cool web automation tools together, because it's the most fun to design them yourself, and it's also about time to graduate from autoit. Some resources would be nice to read over.......

Cheers

MrBlue · Jun 22, 2014

I've pretty much replaced all my old school Python based scraping and automation tools for PhantomJS + CasperJS.

PhantomJS is a headless WebKit scriptable with a JavaScript API.

Code:

http://phantomjs.org/

CasperJS is an open source navigation scripting & testing utility written in Javascript for PhantomJS

Code:

http://casperjs.org/

bighomie · Jun 22, 2014

I've been using Python to scrape the web for a couple of months now. I'm using Python 2.7.3 and a couple of libraries(Beautiful Soup + mechanize) for my web scarping needs. There are a couple of other libraries to use out there, like scrapy, but I have no experience with them.

I have not use any other languages to scrape the web as Python has everything I need. I luv you Python <3

todordonev · Jun 22, 2014

Good old uBot works perfectly for scraping, although it isn't even close to fast.

phatzilla · Jun 23, 2014

Ubot isn't actual programming, though, and i can already formulate strict http web request type programs with auto-it as it is. It's just limited.

TheeAriGrande · Jun 23, 2014

node.js is the way

Here's something to help you get started: http://code.tutsplus.com/tutorials/screen-scraping-with-nodejs--net-25560

dgruergerugerhiye · Jun 23, 2014

Ruby + Mechanize, or Ruby + Watir for browser driven stuff.

theMagicNumber · Jun 24, 2014

C# + HTML Agility pack or SGMLReader.

zohar · Jun 24, 2014

I am in the process of writing one myself. One tip: use some .NET language with a patched (aka fully enabled) IE/Webbrowser component.

Its extrmely hard to find source-code of a working control, but once you found it you basically won the jackpot. It's out there somewhere.

.NET might not be the fastest thing around, but IMHO your server is as fast as the amount of money u have in your pocket.

Good luck.

k0d3r · Jun 25, 2014

C# ?? javascript? C++? :nono:

Life is short, use python!

theMagicNumber · Jun 25, 2014

k0d3r said:
C# ?? javascript? C++? :nono:

Life is short, use python!

I've been using C# for the past few years solely for scraping, because it's easy and quick. Yes, there are bugs and yes, there are things i don't like.
Do you know that majority of the marking softwares flying out there are coded in .NET - zenno poster, mass video blaster, proxy multiply and many more.

IF python allows you to use XPATH over HTML with no additional cost, than i will vote for it, regardless of the awkward syntax.

EDIT: I always wanted to test the performance between python and .net, but never had the time to learn the basics.
I'm certain that there will be almost no difference, but i am curious about parsing the html.
In .NET, there is no mechanism of parsing the html, but using Regex or converting the html to xml(which is very costly, performance wise) and use xpath.

Gary Becks · Jun 26, 2014

python + beatifulsoup or scrapy

Gogol · Jun 26, 2014

Well I am not sure if you would use PHP, but here's some code to help you get started with PHP web scrapers.

The CURL function which fetches html from any given url (you can use file_get_contents, but curl is much more advanced):

Code:

function curly($url) {
    $proxy_list = get_option('proxy_list');
    $proxies = @explode("\n", $proxy_list);
    $proxy_support = false;
    if (!empty($proxies)) {
        $random_proxy = $proxies[rand(0, (count($proxies) - 1))];
        if (!empty($random_proxy)) {
            $proxy_support = true;
        }
    }
    $agent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3';

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // for redirects
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_REFERER, "http://tech5.net");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
    curl_setopt($ch, CURLOPT_TIMEOUT, 400);
    curl_setopt($ch, CURLOPT_POST, false);
    if ($proxy_support)
        curl_setopt($ch, CURLOPT_PROXY, $random_proxy);
    curl_setopt($ch, CURLOPT_URL, $url);
    return curl_exec($ch);
}

I am using Google search result for the fetch example. Code commented for easy understanding:

Code:

$url = 'https://www.google.co.in/search?num=50&safe=off&client=firefox-a&hs=Mn8&rls=org.mozilla:en-US:official&channel=rcs&q=bottlenecks&spell=1&sa=X&ei=q_erU6aTKo-9uAT074GgDQ&ved=0CBoQvwUoAA&biw=1366&bih=587';

$html = curly($url);//this function is defined in the last example. Include this function in your script
$dom = new DOMDocument();

$dom->loadHTML($html);
$xpath = new DOMXPath($dom); 
/* Google shows the links for a search result in a tag inside h3 tags having class "r". So we are doing //h3[@class="r"]//a*/
$links = $xpath->query('//h3[@class="r"]//a');
$length = $links->length; // total number of links fetched
$all_links = array(); // this will store your links

for ($i = 0; $i < $length; $i++) {
  $element = $links->item($i);
  /* find the  href attribute of every element and put the first value in $all_links */
  $all_links[] = $xpath->evaluate('@href', $element)->item(0)->value;
}

/* now your variable $all_links has all the links. do something with it*/

Hope it helps to get you started

Schvamp · Jun 26, 2014

PHP http://simplehtmldom.sourceforge.net/

Gogol · Jun 26, 2014

Why use Simple HTML Dom when PHP has inbuilt DOM classes for the same purpose? Check my previous post for example. It is a lot more faster than Simple HTML Dom because this is pre-compiled code.

Schvamp said:
PHP http://simplehtmldom.sourceforge.net/

theMagicNumber · Jun 26, 2014

What about PHP and multithreading ?

Schvamp · Jun 26, 2014

g0g0l said:
Why use Simple HTML Dom when PHP has inbuilt DOM classes for the same purpose? Check my previous post for example. It is a lot more faster than Simple HTML Dom because this is pre-compiled code.

OP doesn't seem to have any knowledge about PHP. I think jumping into CURL right away might be a huge step.

Looking for webscraping resources....

phatzilla

Supreme Member

MrBlue

Senior Member

bighomie

Junior Member

todordonev

Power Member

phatzilla

Supreme Member

TheeAriGrande

Regular Member

dgruergerugerhiye

BANNED

theMagicNumber

Regular Member

zohar

Newbie

k0d3r

Newbie

theMagicNumber

Regular Member

Gary Becks

Power Member

Gogol

Administrator

Schvamp

Power Member

Gogol

Administrator

theMagicNumber

Regular Member

Schvamp

Power Member

Main Menu

Marketplace

Making Money

BlackHat World