Best Languages for Web Scraping

abd0gheist

Newbie
Joined
Dec 19, 2015
Messages
2
Reaction score
0
Hello all,

As my title suggests, I'm wondering what type of programming languages are best for developing web scrapers? I don't know much about developing or using them, so please pardon the vagueness of my question. Is there particular advantages to using a general-purpose language such as Python for building a scraper, versus a more specialized language?
 
Whatever language allows you to send HTTP requests and parse (x)html or json. Whatever you use to parse (x)html should be forgiving because there is a lot of badly formed html out there.

I've used ruby, python, clojure, and even shell scripts to scrape. They all work pretty well.
 
There are many good scrapers available - you can try python scrapy
 
Pretty much any language will have a web scraping package. I've personally used Ruby + Nokogiri myself, but Python has a few good ones too.
 
I use VB.Net without a problem.

It all depends on how much time you have to invest in a given technology.
 
python, use python-requests to query pages, and python-lxml to parse html (you can use xpath or cssselect to select element you'd like to extract)
 
Python is a good programming to get started and it's very good at scraping
 
I've been scraping using python's requests and beatifulSoup modules. Is there any benefit of using python Scrapy? Is it a software or like a module?
 
There are plenty of choices. I often write scrapers in bash (shell scripts). I just use either cURL or wget to hit the url and download the page then as needed extract the content I want using regex with grep and sed. It's quick and dirty, but it's magic.

I also use iMacros in combination with Javascript. I find once you learn the iMacros syntax they can be very fast to whip up.

Another potential techstack you could do scraping with is Java + Selenium + phantomJS.

The sky is the limit. A protip when writing a scraper for a given site is to hit F12 in your browser to bring up the dev tools then use the selection mode, hover over the text or image u are interest in scraping and the dev tools should give you an indication of what CSS selector you need to target to extract that bit of data.

And if your writing scrapers using regex one gotcha to watch out for is greedy pattern matching. Newbies might find their neatly crafted regex matches the entire page because they ended it with a " or a >
 
Hello all,

As my title suggests, I'm wondering what type of programming languages are best for developing web scrapers? I don't know much about developing or using them, so please pardon the vagueness of my question. Is there particular advantages to using a general-purpose language such as Python for building a scraper, versus a more specialized language?

Here is a great list of tools that may interest you
Code:
https://www.quora.com/Which-are-some-of-the-best-web-data-scraping-tools
https://www.google.com/#q=top+web+scraper
 
I have used Perl over the years to scrape many sites, to log into sites etc. With the prevalence of javascript on sites I have found using a headless browser like PhantomJS the best for getting at the final rendered page.
 
There are plenty of choices. I often write scrapers in bash (shell scripts). I just use either cURL or wget to hit the url and download the page then as needed extract the content I want using regex with grep and sed. It's quick and dirty, but it's magic.

I also use iMacros in combination with Javascript. I find once you learn the iMacros syntax they can be very fast to whip up.

Another potential techstack you could do scraping with is Java + Selenium + phantomJS.

The sky is the limit. A protip when writing a scraper for a given site is to hit F12 in your browser to bring up the dev tools then use the selection mode, hover over the text or image u are interest in scraping and the dev tools should give you an indication of what CSS selector you need to target to extract that bit of data.

And if your writing scrapers using regex one gotcha to watch out for is greedy pattern matching. Newbies might find their neatly crafted regex matches the entire page because they ended it with a " or a >

This is good advice, for programming related questions though its best to look at places like stack overflow before asking on here.
 
This is good advice, for programming related questions though its best to look at places like stack overflow before asking on here.

I tend to find that questions about scraping get heavily downvoted on SO. Mainly because in general as a developer if it's come to the point you have to ask something on stackoverflow it's usefully because it's a really challenging problem and you haven't been able to solve it yourself, so you ask for some help. Then along comes some clueless script kiddie who's like 'need to scrape 1000 porn urls pls help' and that kind of post amongst a group of professionals just comes off the wrong way.

Though, if you are going to ask scraping related questions on stackoverflow, make them look more professional. Ask specifically about the library you are using for scraping (make sure to tag it with that tag too). If possible try not to use the word scraping. You could phrase question titles like 'having trouble targeting css selector' or 'regex accidentally matches the whole page, why?'. You will likely get help without getting downvoted this way.
 
Back
Top