简体   繁体   English

Python了解Unicode转换

[英]Python understanding unicode conversion

I have a text dataset which had some encoding issues. 我有一个文本数据集,其中存在一些编码问题。 The author instructed to do: 作者指示要做:

for line in fpointer:
    line.encode('latin-1').decode('utf-8')

To fix the issues. 解决问题。

I wanted to see why it was required, I opened the file before fixing and saw this line: 我想看看为什么需要它,我在修复之前打开了文件并看到以下行:

103 But in Imax 3-D , the clichés disappear into the vertiginous perspectives opened up by the photography .

After conversion it became: 转换后变成:

103 But in Imax 3-D , the clichés disappear into the vertiginous perspectives opened up by the photography .

It makes sense. 这说得通。

But i could not understand what could have caused the original issue and how did the fix work? 但是我不明白是什么原因导致了原始问题,以及该修复程序是如何工作的?

I referred the unicode python link : https://docs.python.org/3/howto/unicode.html 我提到了unicode python链接: https : //docs.python.org/3/howto/unicode.html

I also checked characters and their values: 我还检查了字符及其值:

The utf-8 encoding for é is c3a9 and the iso-8859-1 encoding for à is c3 and for © it is a9. é的utf-8编码为c3a9,Ã的iso-8859-1编码为c3,©的编码为a9。

It makes some sense but i am not able to make the connection. 这有些道理,但我无法建立连接。

How exactly is the line stored in the original file and how did the code snippet fix it? 该行如何精确地存储在原始文件中,代码片段如何修复该行?

So - what happened is that the text that you had had been "double-encoded"as utf-8. 所以-发生的事情是您原来的文本已经被utf-8“双重编码”了。

So, at some point in the process that generated the data you had, the text that already had an internal representation of "\\xc3\\xa9" for "é" was interpreted as being in latin-1, and re -transformed from "latin1" (where the "\\xc3\\xa9" represents "é") to utf-8, so that ach character was expanded to be in two bytes, becoming: "\\xc3\\x83" "\\xc2\\xa9" (the utf-8 for "é"). 因此,在生成您拥有的数据的过程中的某个时刻,已经具有“é”内部表示为“ \\ xc3 \\ xa9”的文本被解释为位于latin-1中,并从“ latin1” 重新转换为”(其中“ \\ xc3 \\ xa9”代表“Ô)到utf-8,因此ach字符扩展为两个字节,成为:“ \\ xc3 \\ x83”“ \\ xc2 \\ xa9”(utf -8代表“é”)。 As @Novoselov puts it in the other answer this corruption likely came out of you opening the file to read as text, without specifying an encoding on Windows: Python will think the file is "latin-1", the default Windows encoding, and therefore read each byte in there, which is part of an- utf-8 text sequence as a single latin-1 character. 正如@Novoselov在另一个答案中所说的那样,这种损坏很可能是由于您在不指定Windows编码的情况下打开文件将其读取为文本而产生的:Python会认为该文件为默认Windows编码“ latin-1”,因此读取其中的每个字节,这是一个autf-8文本序列的一部分,作为一个拉丁文-1字符。

What the fix did: your system setup is already configured to read text as utf-8 - so when you got the lines in the for loop you got Python-3 strings (Python-2 unicode) correctly interpreted for the UTF-8 characters on the text file. 修复程序的作用:您的系统设置程序已配置为以utf-8格式读取文本-因此,当您在for循环中找到各行时,就可以正确地为上的UTF-8字符解释Python-3字符串(Python-2 unicode)文本文件。 So the 4 byte sequence became 2 text characters. 因此4字节序列变成2个文本字符。 Now, one characteristic of the "latin1" encoding is that it is "transparent": it is equivalent to perform no transform at all in the text bytes. 现在,“ latin1”编码的一个特征是它是“透明的”:等效于在文本字节中根本不执行任何转换。 In other words, each character represented by a value that fits in a single byte in Python's Unicode internal representation becomes a single byte in the encoded byte-string. 换句话说,在Python Unicode内部表示形式中,由适合单个字节的值表示的每个字符将成为编码字节字符串中的单个字节。 (And each character whose value does not fit in a byte can't be encoded as Latin-1 at all, yielding an Unicode-Encode error). (而且每个值不适合一个字节的字符都不能完全编码为Latin-1,从而产生Unicode编码错误)。

So, after the "transparent" encoding step, you have bytes that represent your text - this time with only "one pass" of utf-8 encoding. 因此,在“透明”编码步骤之后,您将拥有代表文本的字节-这次只有utf-8编码“一次通过”。 And decoding these bytes as "utf-8" yielded you the correct text for the file. 将这些字节解码为“ utf-8”会为您提供文件的正确文本。

Again: 再次:

This was the original text: "cliché". 这是原始文本:“cliché”。 Encoded to UTF-8 it becomes like this: b'clich\\xc3\\xa9' But the original process, that created your file, thought of this sequence as being in latin-1, so reconverted both > 0x80 characters to utf-8: b'clich\\xc3\\x83\\xc2\\xa9'. 编码为UTF-8的格式如下:b'clich \\ xc3 \\ xa9'但是创建文件的原始过程认为此序列位于latin-1中,因此将> 0x80的字符都转换为utf-8: b'clich \\ xc3 \\ x83 \\ xc2 \\ xa9'。 And this is what prints as "cliché" 这就是所谓的“ clich©”

On reading, Python3 reads: b'clich\\xc3\\x83\\xc2\\xa9' from the disk, and returns to you "cliché" as (unicode) text. 在读取时,Python3将从磁盘读取:b'clich \\ xc3 \\ x83 \\ xc2 \\ xa9',并以(unicode)文本的形式返回给您“cliché”。 You encode this to bytes, and gets b'clich\\xc3\\xa9' with the call to "encode('latin-1'). Finally you then "decode" that from "utf-8" getting the text "cliché". 您将其编码为字节,并通过调用“ encode('latin-1')”得到b'clich \\ xc3 \\ xa9',最后从“ utf-8”对其进行“解码”,得到文本“cliché”。

Python3 does not easily allow one to spoil text like this. Python3不允许轻易破坏这样的文本。 To go from the text to the incorrect version you had, one has also to use the "transparent" encoding "latin-1" - this is an example: 要从文本转到错误的版本,还必须使用“透明”编码“ latin-1”-这是一个示例:

In [10]: a = "cliché"

In [11]: b = a.encode("utf-8")

In [12]: b
Out[12]: b'clich\xc3\xa9'

In [13]: c = b.decode("latin1").encode("utf-8")

In [14]: c
Out[14]: b'clich\xc3\x83\xc2\xa9'

The original text was encoded in utf-8, but some process decoded it as latin1 and then encoded it as utf-8 again. 原始文本使用utf-8编码,但是某些进程将其解码为latin1,然后再次将其编码为utf-8。

So to get original text, you have to reverse this process: you decode text from file as utf-8 (this is not included in your snippet, but I guess you open it with utf-8 encoding), then encode it as latin1, then decode again as utf-8. 因此,要获取原始文本,您必须逆向此过程:将文件中的文本解码为utf-8(代码段中未包含此文件,但我想您是使用utf-8编码将其打开),然后将其编码为latin1,然后再次解码为utf-8。

From your comment, you say that you are opening a text file in Python 3 without specifying any encoding. 根据您的评论,您说您正在使用Python 3打开文本文件,而未指定任何编码。 In that case, Python uses the system encoding which is Latin1 on Windows. 在这种情况下,Python使用Windows上的Latin1 系统编码。

That is enough to explain what you get if the file was originaly utf8 encoded. 这足以说明如果该文件是原始utf8编码的,您会得到什么。 But IMHO the correct way is to specify the file encoding in the open function: 但是恕我直言,正确的方法是在open函数中指定文件编码:

fd = open(filename, encoding='utf8')

that way, you directly get the correct characters with no need for the encode-decode correction. 这样,您就可以直接获得正确的字符,而无需进行编码解码校正。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM