简体   繁体   English

用Unicode(UTF-8)表示非英语字符

[英]Representing non-English characters with Unicode (UTF-8)

I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. 我正在使用Python中的HTML字符串,该字符串包含非英语字符,该字符在字符串中由16位Unicode十六进制值表示。 The string reads: 该字符串显示为:

"Skr\u00E4ddarev\u00E4gen"

The string when properly converted should read "Skräddarevägen". 正确转换后的字符串应为“Skräddarevägen”。 How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents? 如何确保unicode十六进制值在输出中正确编码/解码并以正确的重音读取?

(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8) Thanks in advance! (请注意,我使用的是Requests和Pandas,两者的编码均设置为utf-8)。预先感谢!

In Python 3, the following can happen: 在Python 3中,可能会发生以下情况:

  1. If you pick up your string from an HTML file, you have to read in the HTML file using the correct encoding. 如果您从HTML文件中提取字符串,则必须使用正确的编码来读取HTML文件。
  2. If you have your string in Python 3 code, it should be already in Unicode (32-bit) in memory. 如果您的字符串使用Python 3代码编写,则内存中的字符串应该已经采用Unicode(32位)格式。

Write the string out to a file, you have to specify the encoding you want in the file open. 将字符串写出到文件中,您必须在文件打开时指定所需的编码。

From your display, it is hard to be sure what is in the string. 从您的显示中,很难确定字符串中包含什么。 Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question. 假设显示的是24个字符,我相信下面的最后一行回答了您的问题。

s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))

This prints 此打印

24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n 
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'

If you are using Python 3 and that is literally the content of the string, it "just works": 如果您使用的是Python 3,并且字面意思是字符串的内容,那么它“就可以工作”:

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'

If you have that string as raw data, you have to decode it. 如果您将该字符串作为原始数据,则必须对其进行解码。 If it is a Unicode string you'll have to encode it to bytes first. 如果它是Unicode字符串,则必须先将其编码为字节。 The final result will be Unicode. 最终结果将是Unicode。 If you already have a byte string, skip the encode step. 如果您已经有一个字节字符串,请跳过编码步骤。

>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'

If you are on Python 2, you'll need to decode, plus print to see it properly: 如果您使用的是Python 2,则需要解码并打印以正确查看它:

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM