[英]How can I convert strings like “\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167” to Chinese characters
I am now working on a small tool to request and decode a webpage, on which the Chinese characters are stored as string like 我现在正在使用一个小的工具来请求和解码网页,在该网页上汉字存储为字符串,例如
\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167
in the source code, something of unicode. 在源代码中,有些是unicode。 I want to convert it to Chinese characters.
我想将其转换为汉字。
I can make it through this website http://rishida.net/tools/conversion/ . 我可以通过此网站http://rishida.net/tools/conversion/进行操作 。 But How can I make it using python?
但是,如何使用python做到这一点?
Those are Unicode codepoints already . 这些已经是Unicode代码点了 。 They represent Chinese characters, but using escape codes that are easier on the developer:
它们代表中文字符,但使用的转义码对开发人员更容易:
>>> print u'\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167'
小王子:汉法英对照
You do not have to do anything to convert those; 您无需做任何转换。 the
\\uxxxx
escape form is simply another way to express the same codepoint. \\uxxxx
转义形式只是表示相同代码点的另一种方式。 See String Literals : 参见字符串文字 :
\\uxxxx
Character with 16-bit hex value xxxx (Unicode only)具有16位十六进制值xxxx的字符(仅Unicode)
\\Uxxxxxxxx
Character with 32-bit hex value xxxxxxxx (Unicode only)具有32位十六进制值xxxxxxxx的字符(仅Unicode)
Python interprets those escape codes when reading the source code to construct the unicode value. 当读取源代码以构造unicode值时,Python会解释这些转义代码。
If the source of the data is not from Python source code but from the web, you have JSON data instead, which uses the same escape format: 如果数据源不是来自Python源代码,而是来自Web,则您将拥有JSON数据,该数据使用相同的转义格式:
>>> import json
>>> print json.loads('"\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167"')
小王子:汉法英对照
Note that the value then needs to be part of a larger string, one that at least includes quotes to mark this a string. 请注意,该值必须是较大字符串的一部分,该字符串至少应包含引号以将其标记为字符串。
Also note that the JSON string escape format differs from Python's when it comes to non-BMP (supplementary) codepoints; 还要注意,在涉及非BMP(补充)代码点时,JSON字符串转义格式与Python不同。 JSON treats those like UTF-16 does, by creating a surrogate pair and use two
\\uxxxx
sequences for such a codepoint. JSON通过创建一个代理对并为这样的代码点使用两个
\\uxxxx
序列,像对待UTF-16一样对待它们。 In Python you'd use a \\Uhhhhhhhh
32-bit hex value. 在Python中,您可以使用
\\Uhhhhhhhh
32位十六进制值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.