1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

What's the fastest way to scrape a lot of pages?

Discussion in 'General Scripting Chat' started by greenculture, Jun 17, 2014.

  1. greenculture

    greenculture Newbie

    Joined:
    Dec 21, 2013
    Messages:
    29
    Likes Received:
    1
    [h=1]What's the fastest way to scrape a lot of pages?

    1.PHP?

    2.PHYTHON?

    3.PERL

    Etc.........
    [/h]
     
  2. Mark Developer

    Mark Developer Newbie

    Joined:
    Jun 14, 2014
    Messages:
    32
    Likes Received:
    30
    Google Scrapy - its simple and open source. Well documented. The fastest solution will be always dedicated scrapers though. I am Python fan - its easy, well documented and have good performance.
     
  3. SeanAustin

    SeanAustin Power Member

    Joined:
    Mar 4, 2013
    Messages:
    740
    Likes Received:
    711
    Location:
    Rocky Mountains
    rumor has it that scrapebox can do just this.
     
  4. TrevorB

    TrevorB Senior Member

    Joined:
    Dec 21, 2011
    Messages:
    1,185
    Likes Received:
    361
    Location:
    Canada
    I heard this also. There was another one too, I think GScraper.
     
  5. ionut.hulub

    ionut.hulub Newbie

    Joined:
    Feb 24, 2013
    Messages:
    23
    Likes Received:
    1
    The question seems to be what's the fastest language for web scrapping.

    Answer is, all are just as fast, because crawlers are very much I/O dependent. That means that a lot of the time the program just waits for a webpage to be downloaded so computing speed is not that important.

    It is, however, a lot faster do develop crawlers in python compared to most other languages and python has got some great libs for that so I think it would be your best choice.
    PS.
    I've made hundreds of crawlers so you could say I know what I'm talking about .
     
  6. member8200

    member8200 Regular Member

    Joined:
    Aug 9, 2014
    Messages:
    469
    Likes Received:
    33
    scrapebox or scrapejet can do the job. no need for programming language. :)
     
  7. Conadovan

    Conadovan Newbie

    Joined:
    Oct 24, 2014
    Messages:
    7
    Likes Received:
    5
    As others have pointed out, if you are willing to pay and don't want to create anything yourself services like scrapebox will do just fine, provided you have a sufficient number of working proxies to bypass any connection-limiters the site you are trying to scrape may implement.

    If, like most code-proficient people, you would rather create your own software for free, the language is largely unimportant, as the primary time limiter for scraping will be the time it takes to download the site source over a connection. As someone pointed out, however, python is probably the easiest to learn and probably has great data transfer libraries (I personally don't use Python much, for no good reason :p).

    As stated, the main issue with scraping lots of data fast is not the speed of the language, but the speed at which you can make new and simultaneous connections to a site without getting blocked. Simply connecting to a site 10 times a second with your personal VPS or whatever's IP will get you blocked in a heartbeat, so the primary art of scraping is finding new methods to connect to a site with lots of IPs, and make those connections seem organic (look like a normal human web user made them).

    Making scraping requests look organic is a whole sub-field in and of itself, but some basic tips include making sure your requests are REALLY only as fast as you NEED them to be; though its nice to get 100,000 keywords in a day, do you really absolutely need them that fast? Most methods sites use to stop scraping are for more forgiving of say 1 connection a second vs 5/second. Throwing some randomness in the intervals between connections is also a simple thing you can do that can help a lot.

    Now as far as finding ways to get lots of adequately fast ips to connect from (and ideally at as low a cost as possible), proxies are of course generally the number one used method. A gagillion people offer private lists of proxy ips for scraping here on BHW and across the web, but at a price. For less dependable connections there are a ton of free proxy lists you can find out on the web with a little work; just search "proxy list" on BHW to find a few.

    A few more common methods used, besides proxies, include spinning up multiple Amazon EC2 instances or any other alternate micro-instance provider (which can be run for cents apiece; you get a new IP per instance), using the tor network (which gives you ~1000 ips to work with, but the IPs are publicly known and often blocked by sites), sending requests through Yahoo pipes (never used so can't report on it, supposedly has a large amount of IPs however), and buying for relatively cheap large blocks of ipv6 addresses.
     
  8. xNotch

    xNotch Registered Member

    Joined:
    Sep 16, 2014
    Messages:
    81
    Likes Received:
    19
    As said above the programming language the scraper is written in really does nothing (significant) to the speed. A scraping bot is going to spend 90% of it's time waiting for a network response so the network is really what dictates a speed of a scraper. There is an exception to this, if your saving a lot of the data you scrape and not just analyzing it then you'll also have to think about your I/O speed as well. For this reason I suggest scraping from a vps which have nice SLA bandwidth and reside in data centers with consistent speeds.
     
    Last edited: Nov 20, 2014
  9. ItsBlinkHere

    ItsBlinkHere Regular Member

    Joined:
    Apr 27, 2014
    Messages:
    409
    Likes Received:
    150
    Location:
    At Large
    Hmmm....are you sure? :)
     
  10. boberbrian

    boberbrian Registered Member

    Joined:
    Jul 6, 2013
    Messages:
    71
    Likes Received:
    22
    You probably want to differ the development time from run-time.

    Run-time will be slower with Python and any interpreted language, while development will be faster because of the higher language.
    Development time will be slower with C but run-time will be quicker.

    Today a development time costs much more than runtime, so I would suggest using python.

    Also use multi-processing (not multithreading!) and you get an optimal speed with a good development time.

    Python can run on windows/mac/linux so you can run it on whatever server you have.

    Good luck!
     
  11. Numbuh362

    Numbuh362 Elite Member

    Joined:
    Aug 22, 2012
    Messages:
    1,568
    Likes Received:
    461
    bots/software like scrape box
     
  12. PHPInjected

    PHPInjected Elite Member

    Joined:
    Apr 25, 2014
    Messages:
    2,144
    Likes Received:
    1,839
    Occupation:
    100% Unique Content Writer
    Location:
    Overriding Methods
    Home Page:
    You can run a iMacros + Javascript script that writes the results out to a .CSV file.
     
  13. botcode

    botcode Newbie

    Joined:
    Oct 30, 2014
    Messages:
    16
    Likes Received:
    1
    To crawl a lot of pages, you need an efficient multithreaded crawler.
    Java and python seem to be the best languages for building multithreaded crawler for the following reasons,
    1. Multi threading support.
    2. Have a good set of libraries. Pythong has Scrapy, Java has crawler4j.
    3. There are a lot of tutorials for writing crawlers in python and java.
    4. A lot of high quality open source libraries are available so that you can easily analyse/process the extracted data.

    To crawl a "lot of pages", you will require to use a distributed multithreaded crawler. Such systems can get really complex. This is when you will start looking for open sources crawlers like,
    1. Apache Nutch
    2. Heritrix (The crawler that wayback uses :))

    Search for how-to-crawl-a-quarter-billion-webpages-in-40-hours and you may find something interesting.
     
  14. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    956
    Likes Received:
    667
    Occupation:
    Web/Bot Developer
    The question OP asked is "What's the fastest way to scrape a lot of pages?"
    iMarcos + Javascript can defenitely scrape web pages but is clearly not the fastest way to scrape a lot of pages.

     
  15. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    956
    Likes Received:
    667
    Occupation:
    Web/Bot Developer
    +1 Great article:
    Code:
    http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/
     
  16. royserpa

    royserpa Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 28, 2011
    Messages:
    4,901
    Likes Received:
    3,670
    Gender:
    Male
    Occupation:
    Negative Options aka Rebills!
    Location:
    Roy's VCCs e-Shop
    Home Page:
    If you want to download them locally to watch them later, you should try wget (http://ftp.gnu.org/gnu/wget/)

    Then use command:

    Code:
    wget -r URL.GOES.HERE
    And it will scrape all pages ;)

    Wget has tons of customizations so you should check the documentation as well as sites like stackoverflow ;)
     
  17. botcode

    botcode Newbie

    Joined:
    Oct 30, 2014
    Messages:
    16
    Likes Received:
    1
    There is one more tool, aria2c. Sometimes aria2c/wget + grep proves to be more useful.
    :)
     
  18. pr250

    pr250 Junior Member

    Joined:
    Apr 7, 2010
    Messages:
    108
    Likes Received:
    23
    It really depends on what you are doing with the data, if you are simply downloading a lot, calling curl/wget in any of the languages would work, if you are manipulating the data of extracting only certain pieces, for me it would depend on exactly what you're doing, each language has benefits and downsides
     
  19. tolos

    tolos Newbie

    Joined:
    Aug 12, 2015
    Messages:
    16
    Likes Received:
    0
    I need a bot developed, if you are experienced in writing bots that can work in the background please send me your sk ype or mail

    Thanks!