Discussion in 'General Scripting Chat' started by greenculture, Jun 17, 2014.
[h=1]What's the fastest way to scrape a lot of pages?
Google Scrapy - its simple and open source. Well documented. The fastest solution will be always dedicated scrapers though. I am Python fan - its easy, well documented and have good performance.
rumor has it that scrapebox can do just this.
I heard this also. There was another one too, I think GScraper.
The question seems to be what's the fastest language for web scrapping.
Answer is, all are just as fast, because crawlers are very much I/O dependent. That means that a lot of the time the program just waits for a webpage to be downloaded so computing speed is not that important.
It is, however, a lot faster do develop crawlers in python compared to most other languages and python has got some great libs for that so I think it would be your best choice.
I've made hundreds of crawlers so you could say I know what I'm talking about .
scrapebox or scrapejet can do the job. no need for programming language.
As others have pointed out, if you are willing to pay and don't want to create anything yourself services like scrapebox will do just fine, provided you have a sufficient number of working proxies to bypass any connection-limiters the site you are trying to scrape may implement.
If, like most code-proficient people, you would rather create your own software for free, the language is largely unimportant, as the primary time limiter for scraping will be the time it takes to download the site source over a connection. As someone pointed out, however, python is probably the easiest to learn and probably has great data transfer libraries (I personally don't use Python much, for no good reason ).
As stated, the main issue with scraping lots of data fast is not the speed of the language, but the speed at which you can make new and simultaneous connections to a site without getting blocked. Simply connecting to a site 10 times a second with your personal VPS or whatever's IP will get you blocked in a heartbeat, so the primary art of scraping is finding new methods to connect to a site with lots of IPs, and make those connections seem organic (look like a normal human web user made them).
Making scraping requests look organic is a whole sub-field in and of itself, but some basic tips include making sure your requests are REALLY only as fast as you NEED them to be; though its nice to get 100,000 keywords in a day, do you really absolutely need them that fast? Most methods sites use to stop scraping are for more forgiving of say 1 connection a second vs 5/second. Throwing some randomness in the intervals between connections is also a simple thing you can do that can help a lot.
Now as far as finding ways to get lots of adequately fast ips to connect from (and ideally at as low a cost as possible), proxies are of course generally the number one used method. A gagillion people offer private lists of proxy ips for scraping here on BHW and across the web, but at a price. For less dependable connections there are a ton of free proxy lists you can find out on the web with a little work; just search "proxy list" on BHW to find a few.
A few more common methods used, besides proxies, include spinning up multiple Amazon EC2 instances or any other alternate micro-instance provider (which can be run for cents apiece; you get a new IP per instance), using the tor network (which gives you ~1000 ips to work with, but the IPs are publicly known and often blocked by sites), sending requests through Yahoo pipes (never used so can't report on it, supposedly has a large amount of IPs however), and buying for relatively cheap large blocks of ipv6 addresses.
As said above the programming language the scraper is written in really does nothing (significant) to the speed. A scraping bot is going to spend 90% of it's time waiting for a network response so the network is really what dictates a speed of a scraper. There is an exception to this, if your saving a lot of the data you scrape and not just analyzing it then you'll also have to think about your I/O speed as well. For this reason I suggest scraping from a vps which have nice SLA bandwidth and reside in data centers with consistent speeds.
Hmmm....are you sure?
You probably want to differ the development time from run-time.
Run-time will be slower with Python and any interpreted language, while development will be faster because of the higher language.
Development time will be slower with C but run-time will be quicker.
Today a development time costs much more than runtime, so I would suggest using python.
Also use multi-processing (not multithreading!) and you get an optimal speed with a good development time.
Python can run on windows/mac/linux so you can run it on whatever server you have.
bots/software like scrape box
To crawl a lot of pages, you need an efficient multithreaded crawler.
Java and python seem to be the best languages for building multithreaded crawler for the following reasons,
1. Multi threading support.
2. Have a good set of libraries. Pythong has Scrapy, Java has crawler4j.
3. There are a lot of tutorials for writing crawlers in python and java.
4. A lot of high quality open source libraries are available so that you can easily analyse/process the extracted data.
To crawl a "lot of pages", you will require to use a distributed multithreaded crawler. Such systems can get really complex. This is when you will start looking for open sources crawlers like,
1. Apache Nutch
2. Heritrix (The crawler that wayback uses )
Search for how-to-crawl-a-quarter-billion-webpages-in-40-hours and you may find something interesting.
The question OP asked is "What's the fastest way to scrape a lot of pages?"
+1 Great article:
If you want to download them locally to watch them later, you should try wget (http://ftp.gnu.org/gnu/wget/)
Then use command:
wget -r URL.GOES.HERE
And it will scrape all pages
Wget has tons of customizations so you should check the documentation as well as sites like stackoverflow
wget -r URL.GOES.HERE
There is one more tool, aria2c. Sometimes aria2c/wget + grep proves to be more useful.
It really depends on what you are doing with the data, if you are simply downloading a lot, calling curl/wget in any of the languages would work, if you are manipulating the data of extracting only certain pieces, for me it would depend on exactly what you're doing, each language has benefits and downsides
I need a bot developed, if you are experienced in writing bots that can work in the background please send me your sk ype or mail
Separate names with a comma.