1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Problem with scraping google results

Discussion in 'General Scripting Chat' started by SeoMS, Aug 21, 2013.

  1. SeoMS

    SeoMS Newbie

    Joined:
    Feb 17, 2013
    Messages:
    4
    Likes Received:
    1
    Hey everyone!
    I've been recently working on a new php project which includes a google results scraper, so I made a scraper which
    works really great and can scrape up to 2400 results in a minute, the only problem is, after a few tests my ip was probbaly blocked
    by google and as a result my scraper stopped working.

    I already tried connecting with proxies, but its really slow and doesn't even work..
    Do you guys have any idea how can I prevent being blocked by google, or somehow bypass thier block?

    Thanks in advance :)
     
  2. innozemec

    innozemec Jr. VIP Jr. VIP

    Joined:
    Aug 19, 2011
    Messages:
    5,316
    Likes Received:
    1,800
    Location:
    www.Indexification.com
    Home Page:
    the only way to safely scrape google is to use random timeouts with lots of proxies and do things the slow way, if you do it way too fast your ips will get blocked
     
  3. SeoMS

    SeoMS Newbie

    Joined:
    Feb 17, 2013
    Messages:
    4
    Likes Received:
    1
    Can shared proxied work in this case? Also, please specify how much is "lots of proxies"?

    Edit : I tried to use some working proxies with curl after my ip got blocked, but it seems like google still recognizes me, or recognizes
    i'm using a proxy so I can't scrape right..
     
    Last edited: Aug 22, 2013
  4. HFlame7

    HFlame7 Regular Member

    Joined:
    Jun 20, 2011
    Messages:
    277
    Likes Received:
    156
    Private proxies are the best. Yes, they are the most expensive but they save hassle.
    Also, if you can change/fake your user agent that also helps with delaying or even bypassing the ban.

    As for how many do you need? It depends on how much scraping you're doing. I'd say get at least 10 private proxies, change the user agent, have random time outs, and test how long all 10 of them last before getting banned.
     
  5. malcsimm

    malcsimm Newbie

    Joined:
    Mar 11, 2010
    Messages:
    17
    Likes Received:
    4
    Location:
    Brighton, UK
    That's interesting, Caravel. I wonder if you - or anyone - can answer this for me: I have had a desktop application written which throws to queries at Google then scrapes the URLs from the top 10 of each which I can then paste into a document.

    Now I want to enhance the application so I can paste in 40 or 50 URLs and it will do the queries one by one and then automatically paste the resulting URLs from each results page into Excel.

    Say it's 40 URLs than this will be making 80 results page requests to Google one after the other.

    Would this risk getting my IP banned?

    - Should I wait, say, 5-10-15 seconds (or a random number of seconds) between each requset?
    - Is it desirable to use private proxies? I've got 20 to use with SB so I can do that - I just have to get it programmed in and not sure whether it's worth it.
    - If I need to use proxies, then how is it best to get my coder to use them? I see you say to change them after 29 requests - is it wise to use the 20 proxies, say, alternately and maybe limit the requests per minute?

    Thanks for any help. Plonking your proxies into Scrapebox is nice and easy: but now I may have to use them with my own software I realise I am a bit clueless.

    Cheers

    Malc
     
  6. malcsimm

    malcsimm Newbie

    Joined:
    Mar 11, 2010
    Messages:
    17
    Likes Received:
    4
    Location:
    Brighton, UK
    OK - I think I answered my own question lol!

    I found a decent article it looks to me:

    searchnewscentral.com/20110928186/General-SEO/how-to-scrape-search-engines-without-pissing-them-off.html

    He says:
    "Based on all of this, here are my guidelines for scraping results:

    1. Scrape slowly. Don?t pound the crap out of Google or Bing. Make your script pause for at least 20 seconds between queries.
    2. Scrape randomly. Randomize the amount of time between queries.
    3. Be a browser. Have a list of typical user agents (browsers). Choose one of those randomly for each query.
    Follow all three of these and you?re a well-behaved scraper."

    He did a test with scraping 3 ways - one got banned instantly, one after 3 SE results and the last not at all.

    If any one has more to add that would be nice :)