Discussion in 'Black Hat SEO Tools' started by YSL, Nov 6, 2009.
Anyone have issues harvesting urls from google recently?
how could you have an issue with this? They can't do anything to prevent it once the page has loaded, you should never use a bot to click on search results, if thats what you're doing as they catch on pretty fast but just harvesting... as long as you are setting appropriate delays between pressing "next" it should be ok but you need to let it wait a while in between, I'm not sure on the amount of time but think of how fast a person would go through a page of results so it has to be reasonable.
Unless you're doing massive scraping runs, you ought to go through their search API. Much more convenient than conventional scraping, IMHO.
I've scraped a few million hits from google.
I am writing an article about it at the moment and free source code (php) to be released soon.
I would suggest using the search API as well.
When I first started scraping google links I remember having to make a slight work around. The pattern of the tags that set the actual page URLs apart from everything else changed each time I opened the page with my program.
My first program matched the related searches which returned urls like
So I just did a preg_replace of the beginning part with http:// and I got the final url.
That would be cool.
Separate names with a comma.