简体   繁体   English

Python UTF8编码

[英]Python UTF8 encoding

I have looked at other questions around Python and encoding but not quite found the solution to my problem. 我看过有关Python和编码的其他问题,但还没有找到解决我问题的方法。 Here it is: 这里是:

I have a small script which attempts to compare 2 lists of files: 我有一个小脚本,试图比较2个文件列表:

  1. A list given in a text file, which is supposed to be encoded in UTF8 (at least Notepad++ detects it as such). 文本文件中给出的列表,应该以UTF8编码(至少Notepad ++会这样检测到该列表)。

  2. A list from a directory which I build like this: 我建立的目录清单如下:

     local = [f.encode('utf-8') for f in listdir(dir) ] 

However, for some characters, I do not get the same representation: when looking in a HEX editor, I find that in 1, the character é is given by 65 cc whereas in 2 it is given by c3 a9 ... 但是,对于某些字符,我没有得到相同的表示形式:在HEX编辑器中查看时,我发现在1中,字符é65 cc给出,而在2中则由c3 a9给出...

What I would like is to have them to the same encoding, whatever it is. 我想要的是使它们具有相同的编码,无论它是什么。

Your first sequence is incomplete - cc is the prefix for a two-byte UTF-8 sequence. 您的第一个序列不完整cc是两字节UTF-8序列的前缀。 Most probably, the full sequence is 65 cc 81 , which indeed is the character e (0x65) followed by a COMBINING ACUTE ACCENT (0x301, which in UTF-8 gets expressed as cc 81 ). 最有可能的是,整个序列是65 cc 81 ,实际上是字符e (0x65),后跟一个COMBINING ACUTE ACCENT (0x301,在UTF-8中表示为cc 81 )。

The other sequence instead is the precomposed LATIN SMALL LETTER E WITH ACUTE character (0xe9, expressed as c3 a9 in UTF-8). 相反,另一个序列是带有 ACUTE字符(0xe9,在UTF-8中表示为c3 a9 )的预组合拉丁文小写字母E。 You'll notice in the linked page that its decomposition is exactly the first sequence. 您会在链接的页面中注意到其分解恰好是第一个序列。

Unicode normalization Unicode规范化

Now, in Unicode there are many instances of different sequences that graphically and/or semantically are the same, and while it's generally a good idea to treat a UTF-8 stream as an opaque binary sequence, this poses a problem if you want to do searching or indexing - looking for one sequence won't match the other, even if they are graphically and semantically the same thing. 现在,在Unicode中,有许多不同序列的实例在图形和/或语义上是相同的,虽然将UTF-8流视为不透明的二进制序列通常是一个好主意,但是如果您想这样做,则会带来问题搜索或索引-寻找一个序列与另一个序列不匹配,即使它们在图形和语义上是相同的。 For this reason, Unicode defines four types of normalization , that can be used to "flatten" this kind of differences and obtain the same codepoints from both the composed and decomposed forms. 因此,Unicode定义了四种类型的规范化 ,可用于“缩小”这种差异并从组合形式和分解形式中获得相同的代码点。 For example, the NFC and NFKC normalization forma in this case will give the 0xe9 code point for both your sequences, while the NFD and NFKD will give the 0x65 0x301 decomposed form. 例如,在这种情况下,NFC和NFKC归一化格式将为您的两个序列提供0xe9代码点,而NFD和NFKD将为0x65 0x301分解形式。

To do this in Python you'll have first to decode your UTF-8 str objects to unicode objects, and then use the unicodedata.normalize method. 要在Python中执行此操作,您必须首先将UTF-8 str对象decodeunicode对象,然后使用unicodedata.normalize方法。

Important note : don't normalize unless you are implementing "intelligent" indexing/searching, and use the normalized data only for this purpose - ie index and search normalized, but store/provide to the user the original form. 重要说明 :除非要实现“智能”索引编制/搜索,否则不要进行标准化,并且仅将标准化数据用于此目的-即标准化索引和搜索,但要向用户存储/提供原始格式。 Normalization is a lossy operation (some forms particularly so), applying it blindly over user data is like entering with a sledgehammer in a pottery shop. 规范化是一种有损操作(尤其是某些形式),盲目地将其应用于用户数据就像在陶瓷店里用大锤进入一样。

File paths 文件路径

Ok, this was about Unicode in general. 好的,这通常是关于Unicode的。 Talking about filesystem paths is both simpler and more complicated. 谈论文件系统路径既简单又复杂。

In line of principle , virtually all common filesystems on Windows and Linux treat paths as opaque character 1 sequences (modulo the directory separator and possibly the NUL character), with no particular normalization form applied 2 . 原则上 ,Windows和Linux上几乎所有常见的文件系统都将路径视为不透明字符1序列(对目录分隔符和可能的NUL字符进行模运算),没有应用特定的规范化形式2 So, in a given directory you can have two file names that look the same but are indeed different: 因此,在给定目录中,您可以拥有两个看起来相同但确实不同的文件名:

终端中名为é的两个文件

So, when dealing with file paths in line of principle you should never normalize - again, file paths are an opaque sequence of code points (actually, an opaque sequence of bytes on Linux) which should not be messed with. 因此,在原则上处理文件路径时,您永远不应规范化 -同样,文件路径是不透明的代码点序列 (实际上,在Linux上是不透明的字节序列),不应混淆。

However, if the list you receive and you have to deal with is normalized differently (which probably means that either it has been passed through a broken software that "helpfully" normalizes composed/decomposed sequences, or that the name has been typed in by hand) you'll have to perform some normalized matching. 但是,如果您收到的列表和必须处理的列表进行了不同的规范化(这可能意味着它已通过损坏的软件传递,可以“有帮助”地规范组成/分解的序列,或者手动输入了名称) ),则必须执行一些标准化匹配。

If I were to deal with a similar ( broken by definition ) scenario, I'd do something like this: 如果要处理类似的情况( 按定义划分 ),我将执行以下操作:

  • first try to match exactly; 首先尝试完全匹配;
  • if this fails, try to match the normalized file name against a set containing the normalized content of the directory; 如果失败,则尝试将标准化文件名与包含目录标准化内容的set进行匹配; notice that, if multiple original names are mapped to the same normalized name and you don't match it exactly you have no way to know which one is the "right one". 请注意,如果将多个原始名称映射到同一个标准化名称, 您与之完全不匹配,则您将无法知道哪一个是“正确的名称”。

Footnotes 脚注

  1. Linux-native filesystems all use 8-bit byte -based paths - they may be in whatever encoding, the kernel doesn't care, although recent systems generally happen to use UTF-8; Linux本地文件系统都使用基于8位字节的路径-内核可以在任何编码下使用它们,而内核并不在乎,尽管最近的系统通常使用UTF-8。 Windows-native filesystem will instead use 16-bit word -based paths, which nominally contain UTF-16 (originally UCS-2) values. Windows本地文件系统将改为使用16位基于单词的路径,这些路径名义上包含UTF-16(最初为UCS-2)值。
  2. On Windows it's a bit more complicated at the API level, since there's the whole ANSI API mess that performs codepage conversion, and case-insensitive matching for Win32 paths adds one more level of complication, but down at kernel and filesystem level it's all opaque 2-byte WCHAR strings. 在Windows上,在API级别上有点复杂,因为存在执行代码页转换的整个ANSI API混乱,并且Win32路径的不区分大小写的匹配增加了更多级别的复杂性,但是在内核和文件系统级别上则完全不透明2个字节的WCHAR字符串。

At the top of your file add these 在文件顶部添加这些

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Hope this helps..! 希望这可以帮助..!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM