I'm parsing a twitter and there's a need to encode the text since in case there is no encoding, there is an exception. But when I use 'utf-8' it doesn't only add b symbol to the console output, but also makes it impossible to access parts of the string. What can I do to fix it or what other encoding should I try?
Here is an example of what happens.
>>> a="newyear"
>>> b=a.encode("utf-8")
>>> a
'newyear'
>>> b
b'newyear'
>>> a[0]
'n'
>>> b[0]
110
My parser code is the following:
tweets=soup.findAll("p", {"class":"TweetTextSize"})
n=0
for tweet in tweets:
n+=1;
print(n)
a=tweet.text
b=a.encode("utf-8")
print(b) #works fine, but returns bytestring, extra b character,
#and I can't get b[0]
print(b.decode("utf-8")) #doesn't work -
#UnicodeEncodeError: ‘charmap’ code can’t encode character '\u2026'
#uncommented try section works, but it replaces "bad" tweets with ops,
#which I'd rather avoid
# try:
# print(tweet.text)
# except:
# print("OPS")
So I can handle the exception with try, but I was wondering if there is some other way.
I'm using Python 3.
you are confused about when to encode
and when to decode
if you have a bytestring then you decode
it into unicode
a="a string"
b = a.decode('utf8')
#b is now UNICODE
if you have unicode you encode
it to an encoded bytestring
a=u"\u00b0C"
b = a.encode('utf8')
#b is now decoded back to a byte string
I suspect you are getting a bytestring back from twitter so you probably need
b = a.decode('utf8')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.