utf-8 encoding and getting a string slice

Question

I'm parsing a twitter and there's a need to encode the text since in case there is no encoding, there is an exception. But when I use 'utf-8' it doesn't only add b symbol to the console output, but also makes it impossible to access parts of the string. What can I do to fix it or what other encoding should I try?

Here is an example of what happens.

>>> a="newyear"
>>> b=a.encode("utf-8")
>>> a
'newyear'
>>> b
b'newyear'
>>> a[0]
'n'
>>> b[0]
110

My parser code is the following:

tweets=soup.findAll("p", {"class":"TweetTextSize"})  

n=0
for tweet in tweets:  


    n+=1;
    print(n)
    a=tweet.text 
    b=a.encode("utf-8")   
    print(b)   #works fine, but returns bytestring, extra b character,
    #and I can't get b[0]
    print(b.decode("utf-8")) #doesn't work - 
    #UnicodeEncodeError: ‘charmap’ code can’t encode character '\u2026'

    #uncommented try section works, but it replaces "bad" tweets with ops, 
    #which I'd rather avoid
    # try:
        # print(tweet.text)
    # except:
        # print("OPS")

So I can handle the exception with try, but I was wondering if there is some other way.

I'm using Python 3.

Answer 1

you are confused about when to encode and when to decode

if you have a bytestring then you decode it into unicode

a="a string" 
b = a.decode('utf8') 
#b is now UNICODE

if you have unicode you encode it to an encoded bytestring

a=u"\u00b0C"
b = a.encode('utf8')
#b is now decoded back to a byte string

I suspect you are getting a bytestring back from twitter so you probably need

b = a.decode('utf8')

utf-8 encoding and getting a string slice

Question

1 answers

solution1
1 2016-08-05 23:35:14

utf-8 encoding and getting a string slice

Question

1 answers

solution1 1 2016-08-05 23:35:14

solution1
1 2016-08-05 23:35:14