Problem with scraping google results

SeoMS · Aug 21, 2013

Hey everyone!
I've been recently working on a new php project which includes a google results scraper, so I made a scraper which
works really great and can scrape up to 2400 results in a minute, the only problem is, after a few tests my ip was probbaly blocked
by google and as a result my scraper stopped working.

I already tried connecting with proxies, but its really slow and doesn't even work..
Do you guys have any idea how can I prevent being blocked by google, or somehow bypass thier block?

Thanks in advance

innozemec · Aug 21, 2013

the only way to safely scrape google is to use random timeouts with lots of proxies and do things the slow way, if you do it way too fast your ips will get blocked

SeoMS · Aug 22, 2013

innozemec said:
the only way to safely scrape google is to use random timeouts with lots of proxies and do things the slow way, if you do it way too fast your ips will get blocked

Can shared proxied work in this case? Also, please specify how much is "lots of proxies"?

Edit : I tried to use some working proxies with curl after my ip got blocked, but it seems like google still recognizes me, or recognizes
i'm using a proxy so I can't scrape right..

HFlame7 · Aug 23, 2013

Private proxies are the best. Yes, they are the most expensive but they save hassle.
Also, if you can change/fake your user agent that also helps with delaying or even bypassing the ban.

As for how many do you need? It depends on how much scraping you're doing. I'd say get at least 10 private proxies, change the user agent, have random time outs, and test how long all 10 of them last before getting banned.

malcsimm · Sep 25, 2013

Caravel said:
A possible reason for the script not even working at all with the slower proxies, is that google
changes their webpage to accommodate for slower connections.

They have an extra version of google just for slow connections that is automatically switched to when your internet is slow.

Private proxies(given they are 300kb/s download or more) should work just fine for your project.

I have made many google scraper's myself and did this successfully on multiple occasions.

To bypass their block, switch proxies every 29 requests (I say 29, because depending on what your scraping, that's the safest answer to give you as there are multiple block levels). If this still causes blocks, then clearing your cache should fix this completely.(although, I'm pretty sure you don't have to in this case)

Hope this helps!

~Caravel

That's interesting, Caravel. I wonder if you - or anyone - can answer this for me: I have had a desktop application written which throws to queries at Google then scrapes the URLs from the top 10 of each which I can then paste into a document.

Now I want to enhance the application so I can paste in 40 or 50 URLs and it will do the queries one by one and then automatically paste the resulting URLs from each results page into Excel.

Say it's 40 URLs than this will be making 80 results page requests to Google one after the other.

Would this risk getting my IP banned?

- Should I wait, say, 5-10-15 seconds (or a random number of seconds) between each requset?
- Is it desirable to use private proxies? I've got 20 to use with SB so I can do that - I just have to get it programmed in and not sure whether it's worth it.
- If I need to use proxies, then how is it best to get my coder to use them? I see you say to change them after 29 requests - is it wise to use the 20 proxies, say, alternately and maybe limit the requests per minute?

Thanks for any help. Plonking your proxies into Scrapebox is nice and easy: but now I may have to use them with my own software I realise I am a bit clueless.

Cheers

Malc

malcsimm · Sep 25, 2013

OK - I think I answered my own question lol!

I found a decent article it looks to me:

searchnewscentral.com/20110928186/General-SEO/how-to-scrape-search-engines-without-pissing-them-off.html

He says:
"Based on all of this, here are my guidelines for scraping results:

Scrape slowly. Don?t pound the crap out of Google or Bing. Make your script pause for at least 20 seconds between queries.
Scrape randomly. Randomize the amount of time between queries.
Be a browser. Have a list of typical user agents (browsers). Choose one of those randomly for each query.

Follow all three of these and you?re a well-behaved scraper."

He did a test with scraping 3 ways - one got banned instantly, one after 3 SE results and the last not at all.

If any one has more to add that would be nice

Problem with scraping google results

SeoMS

Newbie

innozemec

Elite Member

SeoMS

Newbie

HFlame7

Regular Member

malcsimm

Newbie

malcsimm

Newbie

Main Menu

Marketplace

Making Money

BlackHat World