简体   繁体   English

如何使用python转换unicode格式的unicode字符串?

[英]how to convert unicode string on unicode format with python?

I'm a student to learn python scrapy(crawler). 我是一个学习python scrapy(crawler)的学生。

I want to convert unicode string to str in python. 我想将unicode字符串转换为python中的str。 but this unicode string is not common string. 但是此unicode字符串不是通用字符串。 this unicode is unicode format. 此unicode是unicode格式。 please see below code. 请参见下面的代码。

# python 2.7
...
print(type(name[0]))
print(name[0])
print(type(keyword_name_temp))
print(keyword_name_temp)
...

I can see console like below, when run upper script. 运行上脚本时,我可以看到如下所示的控制台。

$ <type 'unicode'>
$ 서용교 ## this words is korean characters
$ <type 'unicode'>
$ u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'

I want see "keyword_name_temp" as korean. 我想将“ keyword_name_temp”视为韩文。 but I don't know how to do... 但我不知道该怎么办...

I got the name list and keyword_name_temp from html code with http request. 我从带有http请求的html代码中获得了名称列表和keyword_name_temp。

name list fundamentally was String format. 名单基本上是字符串格式。

keyword_name_temp fundamentally was unicode format. keyword_name_temp基本上是unicode格式。

please anybody help me ! 请任何人帮助我!

最简单的解决方案是切换到Python 3,默认情况下字符串为Unicode。

u'\\\지\\\방\\\자\\\치\\\단\\\체' contains real backslashes (backslash being an escape character in Python string literals, python interpreter prints backslash in strings as \\\\ ) followed by u and hex sequences, not literal Unicode characters U+C9C0 etc. which are commonly written using \\u\u003c/code> escape sequence (Would that string happen to come from some JSON object perhaps?) u'\\\지\\\방\\\자\\\치\\\단\\\체'包含真实的反斜杠(反斜杠是Python字符串文字中的转义字符,python解释程序将反斜杠在字符串中打印为\\\\ ),后跟u和hex序列,而不是通常使用\\u\u003c/code>转义序列编写的文字Unicode字符U + C9C0等(该字符串是否可能来自某个JSON对象?)

You can construct a JSON string out of it, and use json.loads() to transform to a unicode string: 您可以从中构造一个JSON字符串,然后使用json.loads()转换为unicode字符串:

Example in Python 2.7: Python 2.7中的示例:

>>> s1 = u'서용교'
>>> type(s1)
<type 'unicode'>
>>> s1
u'\uc11c\uc6a9\uad50'
>>> print(s1)
서용교
>>> 
>>> 
>>> s2 = u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'
>>> type(s2)
<type 'unicode'>
>>>
>>> # put that unicode string between double-quotes
>>> # so that json module can interpret it
>>> ts2 = u'"%s"' % s2
>>> ts2
u'"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>>
>>> import json
>>> json.loads(ts2)
u'\uc9c0\ubc29\uc790\uce58\ub2e8\uccb4'
>>> print(json.loads(ts2))
지방자치단체
>>> 

Another option is to make it a string literal 另一种选择是将其设置为字符串文字

>>> import ast
>>>
>>> # construct a string literal, with the 'u' prefix
>>> s2_literal = u'u"%s"' % s2
>>> s2_literal
u'u"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>> print(ast.literal_eval(s2_literal))
지방자치단체
>>> 
>>> # also works with single-quotes string literals
>>> s2_literal2 = u"u'%s'" % s2
>>> s2_literal2
u"u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'"
>>> 
>>> print(ast.literal_eval(s2_literal2))
지방자치단체
>>> 

您的字符串是unicode,并且如果您知道编码:例如utf-8,则可以尝试

print name[0].decode("utf-8")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM