[Proxies required for testing] Google Search Scraper

matessim · Mar 24, 2014

Hi Guys,

I've written a scraper that can scrape Google fairly well and fairly fast and without getting blocked before reaching thousands/tens of thousands of requests in a rather short amount of time.

I'll be glad to release this to you guys and build it to do what anyone here is interested in it doing (it can scrape the URL's, the description, and the title at the moment, with possibly more features such as cached text of the page and preview icons for videos if anyone is interested).

I have reached the problem that i simply can't test it with multiple threads with multiple proxies since i can't seem to find any that work/haven't been blocked by google since they're hammering scrapebox/xrumer while i'm hoping to use them. If anyone can help with highly anonymous socks5 (https should be fine too) proxies, i'll be glad to continue developing this.

Here's a small taster:

Note: This is not via Google API or anything, It's actually a headless browser (with JS Evaluation, Google won't load well otherwise) that's visiting and parsing the page.

matessim · Mar 28, 2014

Anyone want's a google scraper? no demand for this i understand?.

808080Hat · Mar 28, 2014

I've got no proxy for you, sorry but I have some questions:

1.) What headless browser are you using? PhantomJS by any chance?
2.) I've built a google scraper that simply sends out a fake, random user agent and it works (tried it on first 100 results only) for around 50 Keywords at a time. Where do you think the limitations are?
3.) What did you do besides using a headless, JS-enabled browser?
4.) Is your scraper written in python?

Thanks

matessim · Mar 28, 2014

808080Hat said:
I've got no proxy for you, sorry but I have some questions:

1.) What headless browser are you using? PhantomJS by any chance?
2.) I've built a google scraper that simply sends out a fake, random user agent and it works (tried it on first 100 results only) for around 50 Keywords at a time. Where do you think the limitations are?
3.) What did you do besides using a headless, JS-enabled browser?
4.) Is your scraper written in python?

Thanks

1. No, Ghost.py, but a fork I've been working on for a while from the trunk that has quite a few serious speed-ups.
2. Really? I've tried that and Google wouldn't render for me without JS Support, Kudos!
3. I've implemented a few extra features such as retrieving cached pages, made it multi-threaded and work-queue based (doing it with Ghost.py normally would pretty much rape you with, it allocates a crap ton of memory on creating), plus any features anyone requires i will implement.
4. google_search.py speaks for itself i believe.

808080Hat · Mar 28, 2014

matessim said:
2. Really? I've tried that and Google wouldn't render for me without JS Support, Kudos!

Tought so too, but then I took a look at the Source of the SEO Rank Reporter Plugin for Wordpress and they do it the same way. I've implemented the same with Ruby on a really bare bone HTTP-connector and it works. I've never really tried many KWs but I was somewhat astonished by the fact that I didn't got banned after sending 50+ searches per second. Plus this works for many Wordpress installations. Maybe Google is turning a blind eye because of the User Agents?

matessim said:
4. google_search.py speaks for itself i believe.

I think I was a little thrown off by the ActiveMQ, JavaEE stuff.

Here is the code to my version of the rank checker in Ruby:

Code:

require 'nokogiri'
require 'open-uri'

module EightyHatSeo
  class RankChecker
    def self.checkRank(country, keyword, url)
      search_url = "https://www.google.#{country}/search?q=#{keyword}&num=100&pws=0"
      user_agent = ['Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3',
                    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)',
                    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)',
                    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
                    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; en-GB)',
                    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; MS-RTC LM 8)',
                    '(Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0))',
                    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.0',
                    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.50',
                    'Opera/9.20 (Windows NT 6.0; U; en)',
                    'Opera/9.30 (Nintendo Wii; U; ; 2047-7;en)',
                    'Opera 9.4 (Windows NT 6.1; U; en)',
                    'Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9',
                    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0',
                    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; de-de) AppleWebKit/522.11.1 (KHTML, like Gecko) Version/3.0.3 Safari/522.12.1',
                    'Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15',
                    'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15',
                    'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_2; en-gb) AppleWebKit/526+ (KHTML, like Gecko) Version/3.1 iPhone',
                    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.25 (KHTML, like Gecko) Version/3.2 Safari/525.25',
                    'Mozilla/5.0 (Windows; U; Windows NT 6.0; ru-RU) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16',
                    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7; en-us) AppleWebKit/533.4 (KHTML, like Gecko) Version/4.1 Safari/533.4',
                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko Firefox/11.0',
                    'Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))',
                    'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
                    'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; InfoPath.1; SV1; .NET CLR 3.8.36217; WOW64; en-US)',
                    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11',
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11',
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24']

      doc = Nokogiri::HTML(open(search_url, "User-Agent" => user_agent.sample)) rescue OpenURI::HTTPError

      if doc.present?
        doc.css('#search h3.r a').each_with_index do |link, position|
          if link['href'].include?(url)
            return position
          end
        end
      end
      nil
    end
  end
end

Interesting part for me was the URL I didn't know that Google had a &num=100 parameter.

infoking1 · Mar 28, 2014

yep, with 100 SERP per page setting, you can gather 50 keyword with out IP ban. that is 100 *50 =5000 result but some time. But it also based on the time interval between each query. so just ajust the time delay so you can extend the result.

808080Hat · Mar 28, 2014

infoking1 said:
yep, with 100 SERP per page setting, you can gather 50 keyword with out IP ban. that is 100 *50 =5000 result but some time. But it also based on the time interval between each query. so just ajust the time delay so you can extend the result.

Do you think the randomized User Agent makes any difference? Whats the optimal delay? I've been experimenting with 5 Minutes on a Server. That worked ok, I can scan (day/5 minutes)*50=14400 Keywords per day per IP.

matessim · Mar 29, 2014

808080Hat said:
Do you think the randomized User Agent makes any difference? Whats the optimal delay? I've been experimenting with 5 Minutes on a Server. That worked ok, I can scan (day/5 minutes)*50=14400 Keywords per day per IP.

I would doubt the helpfulness of randomizing the user-agent within the same IP address, worth experimenting with that though.

akacash · Mar 29, 2014

Hey I have a general question for you guys. Are you searching/scraping with http or https? Or actually, what host are you using to scrape from? I can see the one in Ruby there to check page rank is using https, but I"m curious if that's the same you're using for scraping.

808080Hat · Mar 29, 2014

akacash said:
Hey I have a general question for you guys. Are you searching/scraping with http or https? Or actually, what host are you using to scrape from? I can see the one in Ruby there to check page rank is using https, but I"m curious if that's the same you're using for scraping.

I've been experimenting a little bit with http and got similar results, https was just the last one I used. Since my script isn't optimized for performance I didn't give the http over https that much tought.

akacash · Mar 30, 2014

808080Hat said:
I've been experimenting a little bit with http and got similar results, https was just the last one I used. Since my script isn't optimized for performance I didn't give the http over https that much tought.

Reason I ask is I found about 25,000 proxies that'll work against the regular http protocol for link scraping, and I believe page rank checking as well, although I'm not as sure about that. Shoot me a PM either one of you and I'll send a few over for you to test.

808080Hat · Apr 3, 2014

Thanks Aka, but if I need Proxies I'll use the webservice url of the Firefox Stealthy Extension. They are usually of a very bad quality, but so easy to grab and you can choose the country.

Code:

http://rcp.stealthy.co/GetProxy?countryCode=DE

Parameter is optional.

Btw. does anybody else know a similar Web Service to get Proxies without much complication?

[Proxies required for testing] Google Search Scraper

matessim

Junior Member

Attachments

matessim

Junior Member

808080Hat

Registered Member

matessim

Junior Member

808080Hat

Registered Member

infoking1

Junior Member

808080Hat

Registered Member

matessim

Junior Member

akacash

Senior Member

808080Hat

Registered Member

akacash

Senior Member

808080Hat

Registered Member

Main Menu

Marketplace

Making Money

BlackHat World