Python Encoding

Discussion in 'General Programming Chat' started by NetCrime, Jan 15, 2015.

  1. NetCrime

    NetCrime Regular Member

    Joined:
    Mar 9, 2011
    Messages:
    236
    Likes Received:
    105
    Location:
    Lithuania
    I'm trying to print scraped data from website that has latin letters ĄČĘĖĮ?ŲŪ

    My code:
    Code:
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    import codecs
    import sys
    
    
    def autoplus():
        url = "http://auto.plius.lt/skelbimai/krovininis-transportas/sunkvezimiai?make_date_from=1989&make_date_to=1997&make_id=4169"
        r = requests.get(url)
        plain_text = r.content
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('h2',{'class':'title-list'}):
              print(link.a.text.encode(sys.stdout.encoding, 'replace'))
              
    autoplus()
    
    
    
    Output:

    Code:
    b'Mercedes-Benz, 308, savivar?iai'
    b'Mercedes-Benz, 609, kieta\x9aoniai'
    b'Mercedes-Benz, 308, bortiniai'
    b'Mercedes-Benz, 609, \x8aaldytuvai'
    b'Mercedes-Benz, 609, \x8aaldytuvai'
    b'Mercedes-Benz, 1114, va\x9eiuokl?s'
    b'Mercedes-Benz, 3344AK, savivar?iai'
    b'Mercedes-Benz, 814, va\x9eiuokl?s'
    b'Mercedes-Benz, 308, \x8aaldytuvai'
    b'Mercedes-Benz, 609, dviguba kabina'
    b'Mercedes-Benz, 609, \x8aaldytuvai'
    
    How do I make Python show actual letters?
     
  2. xNotch

    xNotch Registered Member

    Joined:
    Sep 16, 2014
    Messages:
    82
    Likes Received:
    21
    What are the letters actually for? are they needed or could you just delete them? if you need them maybe try messing around with the default character encode (ie: currently it seems it's set to utf-8)
     
  3. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    975
    Likes Received:
    682
    Occupation:
    Web/Bot Developer
    This should fix your issue:
    Code:
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    import codecs
    import sys
    
    
    def autoplus():
        url = "http://auto.plius.lt/skelbimai/krovininis-transportas/sunkvezimiai?make_date_from=1989&make_date_to=1997&make_id=4169"
        r = requests.get(url)
        plain_text = r.content
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('h2',{'class':'title-list'}):
              print(link.a.text.encode('utf8'))
              
    autoplus()
    
    Now you can just output the results directly to a text file like this:
    Code:
    $ python script.py > output.txt
     
  4. NetCrime

    NetCrime Regular Member

    Joined:
    Mar 9, 2011
    Messages:
    236
    Likes Received:
    105
    Location:
    Lithuania
    Actualy it was my windows terminal that could not print correct characters. I installed PyCharm and everything fixed itself.