简体   繁体   中英

How to convert utf-8 encoding to a string?

I was trying to preprocess some tweet text. The text was in a csv file that has been scraped by tweepy. I am using Jupyter Notebook and let us suppose the it is stored in variable 'p' and the text looks something like this when I just output it using cell output:

"b'@sarahbea34343 \\\\xf0\\\\x9f\\\\x98\\\\x94 I\\\\xe2\\\\x80\\\\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf '"

Instead if I do print(p) in Jupyter then the output is:

"b'@sarahbea34343 \\xf0\\x9f\\x98\\x94 I\\xe2\\x80\\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf '"

I checked on the internet and it seemed that this is in byte class utf-8 encoding. So I tried to decode using ".decode('utf-8')" and it gave an error. The problem that i found out was that as it was stored in csv file the utf-8 encoding was stored as a string and hence this whole tweet was a string. Which means even the backslash is encoded as a string. I don't seem to figure out how do I convert it such that I can remove these emojis and other character's utf encoding?

I have tried multiple things that resulted back in same string again, such as :

p.encode('ascii','ignore').decode('ascii')

or p.encode('latin-1').decode('utf-8').encode('ascii', 'ignore')

If the text really has been stored like this (so you are reading the file in text mode 'r') you can do this:

# Strip leading b and inner quotes
s = "b'@sarahbea34343 \xf0\x9f\x98\x94 I\xe2\x80\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf'"[2:-1]

# Encode as latin-1 to get bytes, decode from unicode-escape to unescape 
# the byte expressions (\\xhh -> \xhh), encode as latin-1 again to get 
# bytes again, then finally decode as UTF-8.

new_s = encode('latin-1').decode('unicode-escape').encode('latin-1').decode('utf-8')
print(new_s)
@sarahbea34343 😔 I’m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM