Scraping yellow pages return gibberish

Discussion in 'General Programming Chat' started by Nick1, Jul 5, 2012.

  1. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
     
    Last edited by a moderator: Jul 5, 2012
  2. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    I just use the common Ruby ways to access the page such as open-uri, Net/http. I really have no clue as to what is causing this.

    EDIT
    The mess above was prompted by this simple segment:

    Code:
    #!/usr/bin/ruby
    #this should work...
    
    
    require 'rubygems'
    require 'nokogiri'
    require 'net/http'
    require 'uri'
    require 'open-uri'
    
    
    #fetch a link
        def fetch(uri_str, limit = 10)
          
          #headers
          headers = {'Referrer' => 'http://www.yellowpages.com','User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1',
          'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language' => 'en-us,en;q=0.5',
          'Accept-Encoding' => 'gzip, deflate',
          'Connection' => 'keep-alive'}
          
          
          # You should choose better exception.
          raise ArgumentError, 'HTTP redirect too deep' if limit == 0
          
          url = URI.parse(URI.encode(uri_str.strip))
          puts url
          
          #get path
          req = Net::HTTP::Get.new(url.path,headers)
          #start TCP/IP
          response = Net::HTTP.start(url.host,url.port) { |http|
                http.request(req)
          }
          case response
          when Net::HTTPSuccess
            then #print final redirect to a file
            puts "this is location" + uri_str
            puts "this is the host #{url.host}"
            puts "this is the path #{url.path}"
            
            return response
            # if you get a 302 response
          when Net::HTTPRedirection 
            then 
            puts "this is redirect" + response['location']
            return fetch(response['location'],aFile, limit - 1)
          else
            response.error!
          end
        end
    
                    #html = fetch("http://www.yellowpages.com/g=#{location}&q=#{query}&page=#{i}/")
                    html = fetch("http://www.yellowpages.com/g=#{location}&q=#{query}&page=#{i}/")
                    
                    aFile = File.new("#{query}#{location}#{i}","w")
                    aFile.write(html.body)
                    aFile.close()
                
    
    
                    puts html.body
                    
    
    PS:Apologies about the mess. Will keep it in mind in the future
     
    Last edited: Jul 5, 2012
  3. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,814
    Likes Received:
    12,462
    Occupation:
    Potentate
    Location:
    Asuncion
    And we have no idea how you 're trying to scrape it :D

    Edit: Ops, you just posted it, thanks

    Are you taking care of possible gzip compression to the server response?
     
    • Thanks Thanks x 1
    Last edited: Jul 5, 2012
  4. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    No this never popped up in my mind actually, I'll look more into this.
     
  5. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,814
    Likes Received:
    12,462
    Occupation:
    Potentate
    Location:
    Asuncion
    In your code, you have this:
    Code:
     'Accept-Encoding' => 'gzip, deflate',
    
    This means you are specifically asking the server to send you the response compressed. I do not know ruby, so I don't know if it automatically takes care of the decompression, or you have to instruct it to.

    Note that removing it does not guarantee you 'll get a non-compressed response, as some servers always send compressed output.
     
    • Thanks Thanks x 1
  6. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    Ok thanks a lot jazzc, this seems to have been the issue.:)
     
    Last edited: Jul 5, 2012
  7. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    347
    Likes Received:
    195
    Actually you shoud always use gzip. The compressed page is 5-10 smaller, which is a huge performance boost and saves bandwidth.
    I don't know ruby, however this thread seems to be helpful:
    http://stackoverflow.com/questions/1361892/how-to-decompress-gzip-string-in-ruby
    Another thing, you have to prepare a huge amount of proxies, yellow pages has anti-bot protection and will ban your IP after 1000-2000 queries.
    I used to use google translate as a proxy, most of the websites won't ban that IP, however google will ban you after 4000 queries. The html returned from the translate tool has original text in the html, a little bit harder to parse, but the structure, ids,classes are preserved.
     
    • Thanks Thanks x 1
    Last edited: Jul 5, 2012