简体   繁体   English

Python 3.6带有Unicode字符和字节的乱码字符串

[英]Python 3.6 Messy String with Unicode characters and Bytes

So I am taking articles titles from the Common Crawl news repo using NewsPlease, but when I take the articles titles, but they are a mixture of normally encoded characters and Unicode bytes, and I am unable to get it encoded correctly. 所以我使用NewsPlease从Common Crawl新闻报道中获取文章标题,但是当我拿到文章标题时,它们是正常编码字符和Unicode字节的混合,我无法正确编码。 Taking one of the titles: 选择其中一个标题:

x = articles[800].title

If I call x in spyder, it returns: 如果我在spyder中调用x,它将返回:

'Las 10 canciones m\\xc3\\xa1s populares de la semana'

When I use print(x) I get: 当我使用print(x)我得到:

Las 10 canciones m\xc3\xa1s populares de la semana

BUT if try to correctly encode it using: (As other posts suggest) 但如果尝试使用以下方法正确编码:(如其他帖子所示)

x.encode('latin1').decode('utf8')

It returns 它回来了

'Las 10 canciones m\\xc3\\xa1s populares de la semana'

Which is obviously not correct. 这显然是不正确的。

Anyone have any suggestions? 有人有什么建议吗? I am using Python 3.6 by the way 我顺便使用Python 3.6

Found a solution to this: 找到了解决方案:

x = 'this is a test of the Spanish word m\\xc3\\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM