Quick Questions about Scraping Google

Discussion in 'BlackHat Lounge' started by agag2, Dec 10, 2012.

  1. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    Hello


    I'm working w a programmer to create a tool that will scrape Google search results - but we have encountered several problems (I cannot use scrapebox for this, so I'm creating a custom solution).


    1. If we use Google API we're limited to 100 searches per day. This isn't a good idea.


    If we scrape raw HTML (no API) we're limited to 1,000 searches per day (this is what my programmer claims - is it true? I thought it was 1,000 per hour..)


    For this I believe the only solution is scraping w proxies. Correct?


    2. If querying Google many times @ the same time we get captcha.


    Is there a way around this? What is the max queries we can do w/o getting captcha?


    More specifically, I've used scrapebox and I can query Google tens of thousands of times per hour w a dozen proxies and never get captcha once. So why would we get captcha after querying several times -- simultaneously?


    How do that do it?


    Lastly does scrapebox use HTML scraping or Google API ? (I doubt it uses API - just confirming). And if it does use HTML how does it scrape so fast? My programmer claims that scraping via HTML will be very slow.


    BTW I plan on hitting server several hundred thousand times per day, if not per hour.


    Any help / insight would be greatly appreciated


    Thanks
     
  2. agag2

    agag2 Supreme Member

    Joined:
    Feb 17, 2009
    Messages:
    1,309
    Likes Received:
    254
    Anyone ...?