简体   繁体   中英

utf-8 encoding and getting a string slice

I'm parsing a twitter and there's a need to encode the text since in case there is no encoding, there is an exception. But when I use 'utf-8' it doesn't only add b symbol to the console output, but also makes it impossible to access parts of the string. What can I do to fix it or what other encoding should I try?

Here is an example of what happens.

>>> a="newyear"
>>> b=a.encode("utf-8")
>>> a
'newyear'
>>> b
b'newyear'
>>> a[0]
'n'
>>> b[0]
110

My parser code is the following:

tweets=soup.findAll("p", {"class":"TweetTextSize"})  

n=0
for tweet in tweets:  


    n+=1;
    print(n)
    a=tweet.text 
    b=a.encode("utf-8")   
    print(b)   #works fine, but returns bytestring, extra b character,
    #and I can't get b[0]
    print(b.decode("utf-8")) #doesn't work - 
    #UnicodeEncodeError: ‘charmap’ code can’t encode character '\u2026'

    #uncommented try section works, but it replaces "bad" tweets with ops, 
    #which I'd rather avoid
    # try:
        # print(tweet.text)
    # except:
        # print("OPS")

So I can handle the exception with try, but I was wondering if there is some other way.

I'm using Python 3.

you are confused about when to encode and when to decode

if you have a bytestring then you decode it into unicode

a="a string" 
b = a.decode('utf8') 
#b is now UNICODE

if you have unicode you encode it to an encoded bytestring

a=u"\u00b0C"
b = a.encode('utf8')
#b is now decoded back to a byte string

I suspect you are getting a bytestring back from twitter so you probably need

b = a.decode('utf8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM