1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[Proxies required for testing] Google Search Scraper

Discussion in 'Black Hat SEO Tools' started by matessim, Mar 24, 2014.

  1. matessim

    matessim Junior Member

    Joined:
    Nov 22, 2008
    Messages:
    164
    Likes Received:
    72
    Occupation:
    Being funny and kind to puppies
    Location:
    UT 2003
    Hi Guys,

    I've written a scraper that can scrape Google fairly well and fairly fast and without getting blocked before reaching thousands/tens of thousands of requests in a rather short amount of time.

    I'll be glad to release this to you guys and build it to do what anyone here is interested in it doing (it can scrape the URL's, the description, and the title at the moment, with possibly more features such as cached text of the page and preview icons for videos if anyone is interested).

    I have reached the problem that i simply can't test it with multiple threads with multiple proxies since i can't seem to find any that work/haven't been blocked by google since they're hammering scrapebox/xrumer while i'm hoping to use them. If anyone can help with highly anonymous socks5 (https should be fine too) proxies, i'll be glad to continue developing this.

    Here's a small taster:
    [​IMG]

    Note: This is not via Google API or anything, It's actually a headless browser (with JS Evaluation, Google won't load well otherwise) that's visiting and parsing the page.
     

    Attached Files:

    Last edited: Mar 24, 2014
  2. matessim

    matessim Junior Member

    Joined:
    Nov 22, 2008
    Messages:
    164
    Likes Received:
    72
    Occupation:
    Being funny and kind to puppies
    Location:
    UT 2003
    Anyone want's a google scraper? no demand for this i understand?.
     
  3. 808080Hat

    808080Hat Registered Member

    Joined:
    Feb 1, 2014
    Messages:
    65
    Likes Received:
    24
    Occupation:
    Freelance Software Developer/Architect
    Location:
    Berlin, Germany
    I've got no proxy for you, sorry but I have some questions:

    1.) What headless browser are you using? PhantomJS by any chance?
    2.) I've built a google scraper that simply sends out a fake, random user agent and it works (tried it on first 100 results only) for around 50 Keywords at a time. Where do you think the limitations are?
    3.) What did you do besides using a headless, JS-enabled browser?
    4.) Is your scraper written in python?

    Thanks
     
  4. matessim

    matessim Junior Member

    Joined:
    Nov 22, 2008
    Messages:
    164
    Likes Received:
    72
    Occupation:
    Being funny and kind to puppies
    Location:
    UT 2003
    1. No, Ghost.py, but a fork I've been working on for a while from the trunk that has quite a few serious speed-ups.
    2. Really? I've tried that and Google wouldn't render for me without JS Support, Kudos!
    3. I've implemented a few extra features such as retrieving cached pages, made it multi-threaded and work-queue based (doing it with Ghost.py normally would pretty much rape you with, it allocates a crap ton of memory on creating), plus any features anyone requires i will implement.
    4. google_search.py speaks for itself i believe.
     
    • Thanks Thanks x 1
    Last edited: Mar 28, 2014
  5. 808080Hat

    808080Hat Registered Member

    Joined:
    Feb 1, 2014
    Messages:
    65
    Likes Received:
    24
    Occupation:
    Freelance Software Developer/Architect
    Location:
    Berlin, Germany
    Tought so too, but then I took a look at the Source of the SEO Rank Reporter Plugin for Wordpress and they do it the same way. I've implemented the same with Ruby on a really bare bone HTTP-connector and it works. I've never really tried many KWs but I was somewhat astonished by the fact that I didn't got banned after sending 50+ searches per second. Plus this works for many Wordpress installations. Maybe Google is turning a blind eye because of the User Agents?

    I think I was a little thrown off by the ActiveMQ, JavaEE stuff.

    Here is the code to my version of the rank checker in Ruby:

    Code:
    require 'nokogiri'
    require 'open-uri'
    
    module EightyHatSeo
      class RankChecker
        def self.checkRank(country, keyword, url)
          search_url = "https://www.google.#{country}/search?q=#{keyword}&num=100&pws=0"
          user_agent = ['Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3',
                        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)',
                        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)',
                        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; en-GB)',
                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; MS-RTC LM 8)',
                        '(Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0))',
                        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.0',
                        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.50',
                        'Opera/9.20 (Windows NT 6.0; U; en)',
                        'Opera/9.30 (Nintendo Wii; U; ; 2047-7;en)',
                        'Opera 9.4 (Windows NT 6.1; U; en)',
                        'Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9',
                        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0',
                        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; de-de) AppleWebKit/522.11.1 (KHTML, like Gecko) Version/3.0.3 Safari/522.12.1',
                        'Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15',
                        'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15',
                        'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_2; en-gb) AppleWebKit/526+ (KHTML, like Gecko) Version/3.1 iPhone',
                        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_5; en-us) AppleWebKit/525.25 (KHTML, like Gecko) Version/3.2 Safari/525.25',
                        'Mozilla/5.0 (Windows; U; Windows NT 6.0; ru-RU) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16',
                        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7; en-us) AppleWebKit/533.4 (KHTML, like Gecko) Version/4.1 Safari/533.4',
                        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko Firefox/11.0',
                        'Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))',
                        'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
                        'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; InfoPath.1; SV1; .NET CLR 3.8.36217; WOW64; en-US)',
                        'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11',
                        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11',
                        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24']
    
          doc = Nokogiri::HTML(open(search_url, "User-Agent" => user_agent.sample)) rescue OpenURI::HTTPError
    
          if doc.present?
            doc.css('#search h3.r a').each_with_index do |link, position|
              if link['href'].include?(url)
                return position
              end
            end
          end
          nil
        end
      end
    end
    
    Interesting part for me was the URL I didn't know that Google had a &num=100 parameter.
     
    Last edited: Mar 28, 2014
  6. infoking1

    infoking1 Junior Member

    Joined:
    Sep 16, 2010
    Messages:
    182
    Likes Received:
    24
    Home Page:
    yep, with 100 SERP per page setting, you can gather 50 keyword with out IP ban. that is 100 *50 =5000 result but some time. But it also based on the time interval between each query. so just ajust the time delay so you can extend the result.
     
  7. 808080Hat

    808080Hat Registered Member

    Joined:
    Feb 1, 2014
    Messages:
    65
    Likes Received:
    24
    Occupation:
    Freelance Software Developer/Architect
    Location:
    Berlin, Germany
    Do you think the randomized User Agent makes any difference? Whats the optimal delay? I've been experimenting with 5 Minutes on a Server. That worked ok, I can scan (day/5 minutes)*50=14400 Keywords per day per IP.
     
  8. matessim

    matessim Junior Member

    Joined:
    Nov 22, 2008
    Messages:
    164
    Likes Received:
    72
    Occupation:
    Being funny and kind to puppies
    Location:
    UT 2003
    I would doubt the helpfulness of randomizing the user-agent within the same IP address, worth experimenting with that though.
     
    • Thanks Thanks x 1
  9. akacash

    akacash Jr. VIP Jr. VIP

    Joined:
    Jan 16, 2010
    Messages:
    805
    Likes Received:
    575
    Location:
    The Beach, USA
    Hey I have a general question for you guys. Are you searching/scraping with http or https? Or actually, what host are you using to scrape from? I can see the one in Ruby there to check page rank is using https, but I"m curious if that's the same you're using for scraping.
     
  10. 808080Hat

    808080Hat Registered Member

    Joined:
    Feb 1, 2014
    Messages:
    65
    Likes Received:
    24
    Occupation:
    Freelance Software Developer/Architect
    Location:
    Berlin, Germany
    I've been experimenting a little bit with http and got similar results, https was just the last one I used. Since my script isn't optimized for performance I didn't give the http over https that much tought.
     
  11. akacash

    akacash Jr. VIP Jr. VIP

    Joined:
    Jan 16, 2010
    Messages:
    805
    Likes Received:
    575
    Location:
    The Beach, USA
    Reason I ask is I found about 25,000 proxies that'll work against the regular http protocol for link scraping, and I believe page rank checking as well, although I'm not as sure about that. Shoot me a PM either one of you and I'll send a few over for you to test.
     
  12. 808080Hat

    808080Hat Registered Member

    Joined:
    Feb 1, 2014
    Messages:
    65
    Likes Received:
    24
    Occupation:
    Freelance Software Developer/Architect
    Location:
    Berlin, Germany
    Thanks Aka, but if I need Proxies I'll use the webservice url of the Firefox Stealthy Extension. They are usually of a very bad quality, but so easy to grab and you can choose the country.

    Code:
    http://rcp.stealthy.co/GetProxy?countryCode=DE
    
    Parameter is optional.

    Btw. does anybody else know a similar Web Service to get Proxies without much complication?