[英]How to convert utf-8 encoding to a string?
I was trying to preprocess some tweet text.我试图预处理一些推文文本。 The text was in a csv file that has been scraped by tweepy.
文本位于已被 tweepy 抓取的 csv 文件中。 I am using Jupyter Notebook and let us suppose the it is stored in variable 'p' and the text looks something like this when I just output it using cell output:
我正在使用 Jupyter Notebook,让我们假设它存储在变量 'p' 中,当我使用单元格输出输出它时,文本看起来像这样:
"b'@sarahbea34343 \\\\xf0\\\\x9f\\\\x98\\\\x94 I\\\\xe2\\\\x80\\\\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf '"
“b'@sarahbea34343 \\\\xf0\\\\x9f\\\\x98\\\\x94 I\\\\xe2\\\\x80\\\\x99m 不会过于乐观,但是嘿...... https://twitter.com/icxdsfdf '”
Instead if I do print(p) in Jupyter then the output is:相反,如果我在 Jupyter 中执行 print(p) 那么输出是:
"b'@sarahbea34343 \\xf0\\x9f\\x98\\x94 I\\xe2\\x80\\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf '"
“b'@sarahbea34343 \\xf0\\x9f\\x98\\x94 I\\xe2\\x80\\x99m 不会过于乐观,但是嘿...... https://twitter.com/icxdsfdf '”
I checked on the internet and it seemed that this is in byte class utf-8 encoding.我在互联网上查了一下,这似乎是字节类 utf-8 编码。 So I tried to decode using ".decode('utf-8')" and it gave an error.
因此,我尝试使用“.decode('utf-8')”进行解码,但出现错误。 The problem that i found out was that as it was stored in csv file the utf-8 encoding was stored as a string and hence this whole tweet was a string.
我发现的问题是,当它存储在 csv 文件中时,utf-8 编码被存储为一个字符串,因此整个推文都是一个字符串。 Which means even the backslash is encoded as a string.
这意味着即使反斜杠也被编码为字符串。 I don't seem to figure out how do I convert it such that I can remove these emojis and other character's utf encoding?
我似乎不知道如何转换它以便我可以删除这些表情符号和其他字符的 utf 编码?
I have tried multiple things that resulted back in same string again, such as :我尝试了多种导致再次返回相同字符串的方法,例如:
p.encode('ascii','ignore').decode('ascii')
p.encode('ascii','ignore').decode('ascii')
or p.encode('latin-1').decode('utf-8').encode('ascii', 'ignore')
或 p.encode('latin-1').decode('utf-8').encode('ascii', 'ignore')
If the text really has been stored like this (so you are reading the file in text mode 'r') you can do this:如果文本确实是这样存储的(因此您正在以文本模式 'r' 读取文件),则可以执行以下操作:
# Strip leading b and inner quotes
s = "b'@sarahbea34343 \xf0\x9f\x98\x94 I\xe2\x80\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf'"[2:-1]
# Encode as latin-1 to get bytes, decode from unicode-escape to unescape
# the byte expressions (\\xhh -> \xhh), encode as latin-1 again to get
# bytes again, then finally decode as UTF-8.
new_s = encode('latin-1').decode('unicode-escape').encode('latin-1').decode('utf-8')
print(new_s)
@sarahbea34343 😔 I’m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.