1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping yellow pages return gibberish

Discussion in 'General Programming Chat' started by Nick1, Jul 5, 2012.

  1. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    I am trying to scrape the yellow pages and it is not returning HTML as it should but complete gibberish:

    I have no idea why it is doing this.
     
    Last edited by a moderator: Jul 5, 2012
  2. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    I just use the common Ruby ways to access the page such as open-uri, Net/http. I really have no clue as to what is causing this.

    EDIT
    The mess above was prompted by this simple segment:

    Code:
    #!/usr/bin/ruby
    #this should work...
    
    
    require 'rubygems'
    require 'nokogiri'
    require 'net/http'
    require 'uri'
    require 'open-uri'
    
    
    #fetch a link
        def fetch(uri_str, limit = 10)
          
          #headers
          headers = {'Referrer' => 'http://www.yellowpages.com','User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1',
          'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language' => 'en-us,en;q=0.5',
          'Accept-Encoding' => 'gzip, deflate',
          'Connection' => 'keep-alive'}
          
          
          # You should choose better exception.
          raise ArgumentError, 'HTTP redirect too deep' if limit == 0
          
          url = URI.parse(URI.encode(uri_str.strip))
          puts url
          
          #get path
          req = Net::HTTP::Get.new(url.path,headers)
          #start TCP/IP
          response = Net::HTTP.start(url.host,url.port) { |http|
                http.request(req)
          }
          case response
          when Net::HTTPSuccess
            then #print final redirect to a file
            puts "this is location" + uri_str
            puts "this is the host #{url.host}"
            puts "this is the path #{url.path}"
            
            return response
            # if you get a 302 response
          when Net::HTTPRedirection 
            then 
            puts "this is redirect" + response['location']
            return fetch(response['location'],aFile, limit - 1)
          else
            response.error!
          end
        end
    
                    #html = fetch("http://www.yellowpages.com/g=#{location}&q=#{query}&page=#{i}/")
                    html = fetch("http://www.yellowpages.com/g=#{location}&q=#{query}&page=#{i}/")
                    
                    aFile = File.new("#{query}#{location}#{i}","w")
                    aFile.write(html.body)
                    aFile.close()
                
    
    
                    puts html.body
                    
    
    PS:Apologies about the mess. Will keep it in mind in the future
     
    Last edited: Jul 5, 2012
  3. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,143
    And we have no idea how you 're trying to scrape it :D

    Edit: Ops, you just posted it, thanks

    Are you taking care of possible gzip compression to the server response?
     
    • Thanks Thanks x 1
    Last edited: Jul 5, 2012
  4. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    No this never popped up in my mind actually, I'll look more into this.
     
  5. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,143
    In your code, you have this:
    Code:
     'Accept-Encoding' => 'gzip, deflate',
    
    This means you are specifically asking the server to send you the response compressed. I do not know ruby, so I don't know if it automatically takes care of the decompression, or you have to instruct it to.

    Note that removing it does not guarantee you 'll get a non-compressed response, as some servers always send compressed output.
     
    • Thanks Thanks x 1
  6. Nick1

    Nick1 Junior Member

    Joined:
    Oct 16, 2009
    Messages:
    196
    Likes Received:
    45
    Ok thanks a lot jazzc, this seems to have been the issue.:)
     
    Last edited: Jul 5, 2012
  7. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    Actually you shoud always use gzip. The compressed page is 5-10 smaller, which is a huge performance boost and saves bandwidth.
    I don't know ruby, however this thread seems to be helpful:
    http://stackoverflow.com/questions/1361892/how-to-decompress-gzip-string-in-ruby
    Another thing, you have to prepare a huge amount of proxies, yellow pages has anti-bot protection and will ban your IP after 1000-2000 queries.
    I used to use google translate as a proxy, most of the websites won't ban that IP, however google will ban you after 4000 queries. The html returned from the translate tool has original text in the html, a little bit harder to parse, but the structure, ids,classes are preserved.
     
    • Thanks Thanks x 1
    Last edited: Jul 5, 2012