在python中将拉丁字符串转换为unicode

Question

I am working o scrapy, I scraped some sites and stored the items from the scraped page in to json files, but some of them are containing the following format. 我正在使用scrapy，我抓了一些网站并将抓取页面中的项目存储到json文件中，但其中一些包含以下格式。

l = ["Holding it Together",
     "Fowler RV Trip",
     "S\u00e9n\u00e9gal - Mali - Niger","H\u00eatres et \u00e9tang",
     "Coll\u00e8ge marsan","N\u00b0one",
     "Lines through the days 1 (Arabic) \u0633\u0637\u0648\u0631 \u0639\u0628\u0631 \u0627\u0644\u0623\u064a\u0627\u0645 1",
     "\u00cdndia, Tail\u00e2ndia &amp; Cingapura"]

I can expect that the list consists of different format, but i want to convert that and store the strings in the list with their original names like below 我可以预期该列表包含不同的格式，但我想转换它并将列表中的字符串与其原始名称一起存储，如下所示

l = ["Holding it Together",
     "Fowler RV Trip",
     "Lines through the days 1 (Arabic) سطور عبر الأيام 1 | شمس الدين خ | Blogs"         ,
     "Índia, Tailândia & Cingapura "]

Thanks in advance........... 提前致谢...........

Answer 1

You have byte strings containing unicode escapes. 您有包含unicode转义的字节字符串。 You can convert them to unicode with the unicode_escape codec: 您可以使用unicode_escape编解码器将它们转换为unicode：

>>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
Hêtres et étang

And you can encode it back to byte strings: 您可以将其编码回字节字符串：

>>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
>>> s.encode("latin1")
'H\xeatres et \xe9tang'

You can filter and decode the non-unicode strings like: 您可以过滤和解码非unicode字符串，如：

for s in l: 
    if not isinstance(s, unicode): 
        print s.decode('unicode_escape')

Answer 2

i want to convert that and store the strings in the list with their original names like below 我想转换它并将字符串存储在列表中，其原始名称如下所示

When you serialise to JSON, there may be a flag that allows you to turn off the escaping of non-ASCII characters to \\u\u003c/code> sequences. 当您序列化为JSON时，可能会有一个标志，允许您关闭非ASCII字符到\\u\u003c/code>序列的转义。 If you are using the standard library json module, it's ensure_ascii : 如果您使用的是标准库json模块，那就是ensure_ascii ：

>>> print json.dumps(u'Índia')
"\u00cdndia"
>>> print json.dumps(u'Índia', ensure_ascii= False)
"Índia"

However be aware that with that safety measure taken away you now have to be able to deal with non-ASCII characters in a correct way, or you'll get a bunch of UnicodeError s. 但请注意，随着安全措施的消失，您现在必须能够以正确的方式处理非ASCII字符，否则您将获得一堆UnicodeError 。 For example if you are writing the JSON to a file you must explicitly encode the Unicode string to the charset you want (for example UTF-8). 例如，如果要将JSON写入文件，则必须将Unicode字符串显式编码为所需的字符集（例如UTF-8）。

j= json.dumps(u'Índia', ensure_ascii= False)
open('file.json', 'wb').write(j.encode('utf-8'))

在python中将拉丁字符串转换为unicode

问题描述

2 个解决方案

解决方案1
7 已采纳 2012-05-25 07:54:31

解决方案2
1 2012-05-26 08:25:44

在python中将拉丁字符串转换为unicode

问题描述

2 个解决方案

解决方案1 7 已采纳 2012-05-25 07:54:31

解决方案2 1 2012-05-26 08:25:44

解决方案1
7 已采纳 2012-05-25 07:54:31

解决方案2
1 2012-05-26 08:25:44