简体   繁体   中英

how to convert unicode string on unicode format with python?

I'm a student to learn python scrapy(crawler).

I want to convert unicode string to str in python. but this unicode string is not common string. this unicode is unicode format. please see below code.

# python 2.7
...
print(type(name[0]))
print(name[0])
print(type(keyword_name_temp))
print(keyword_name_temp)
...

I can see console like below, when run upper script.

$ <type 'unicode'>
$ 서용교 ## this words is korean characters
$ <type 'unicode'>
$ u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'

I want see "keyword_name_temp" as korean. but I don't know how to do...

I got the name list and keyword_name_temp from html code with http request.

name list fundamentally was String format.

keyword_name_temp fundamentally was unicode format.

please anybody help me !

最简单的解决方案是切换到Python 3,默认情况下字符串为Unicode。

u'\\\지\\\방\\\자\\\치\\\단\\\체' contains real backslashes (backslash being an escape character in Python string literals, python interpreter prints backslash in strings as \\\\ ) followed by u and hex sequences, not literal Unicode characters U+C9C0 etc. which are commonly written using \\u\u003c/code> escape sequence (Would that string happen to come from some JSON object perhaps?)

You can construct a JSON string out of it, and use json.loads() to transform to a unicode string:

Example in Python 2.7:

>>> s1 = u'서용교'
>>> type(s1)
<type 'unicode'>
>>> s1
u'\uc11c\uc6a9\uad50'
>>> print(s1)
서용교
>>> 
>>> 
>>> s2 = u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'
>>> type(s2)
<type 'unicode'>
>>>
>>> # put that unicode string between double-quotes
>>> # so that json module can interpret it
>>> ts2 = u'"%s"' % s2
>>> ts2
u'"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>>
>>> import json
>>> json.loads(ts2)
u'\uc9c0\ubc29\uc790\uce58\ub2e8\uccb4'
>>> print(json.loads(ts2))
지방자치단체
>>> 

Another option is to make it a string literal

>>> import ast
>>>
>>> # construct a string literal, with the 'u' prefix
>>> s2_literal = u'u"%s"' % s2
>>> s2_literal
u'u"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
>>> print(ast.literal_eval(s2_literal))
지방자치단체
>>> 
>>> # also works with single-quotes string literals
>>> s2_literal2 = u"u'%s'" % s2
>>> s2_literal2
u"u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'"
>>> 
>>> print(ast.literal_eval(s2_literal2))
지방자치단체
>>> 

您的字符串是unicode,并且如果您知道编码:例如utf-8,则可以尝试

print name[0].decode("utf-8")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM