[英]How do I detect if a file is encoded using UTF-8?
Is there a way to recognize if text file is UTF-8 in Python? 有没有办法识别文本文件是否是Python中的UTF-8?
I would really like to get if the file is UTF-8 or not. 如果文件是UTF-8,我真的很想得到。 I don't need to detect other encodings. 我不需要检测其他编码。
You mentioned in a comment you only need to detect UTF-8. 您在评论中提到,您只需要检测UTF-8。 If you know the alternative consists of only single byte encodings, then there is a solution that often works. 如果你知道替代方案只包含单字节编码,那么就有一种解决方案可以正常工作。
If you know it's either UTF-8 or single byte encoding like latin-1
, then try opening it first in UTF-8 and then in the other encoding. 如果你知道它是UTF-8或像latin-1
这样的单字节编码,那么先尝试在UTF-8中打开它,然后再在其他编码中打开它。 If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. 如果文件仅包含ASCII字符,则最终将以UTF-8打开,即使它是用作其他编码。 If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two. 如果它包含任何非ASCII字符,则几乎总能正确检测两者之间的正确字符集。
try:
# or codecs.open on Python <= 2.5
# or io.open on Python > 2.5 and <= 2.7
filedata = open(filename, encoding='UTF-8').read()
except:
filedata = open(filename, encoding='other-single-byte-encoding').read()
Your best bet is to use the chardet
package from PyPI , either directly or through UnicodeDamnit
from BeautifulSoup: 最好的办法是直接或通过BeautifulSoup的UnicodeDamnit
使用PyPI的chardet
包 :
chardet 1.0.1 chardet 1.0.1
Universal encoding detector 通用编码检测器
Detects: 检测:
- ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants) ASCII,UTF-8,UTF-16(2种变体),UTF-32(4种变体)
- Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese) Big5,GB2312,EUC-TW,HZ-GB-2312,ISO-2022-CN(繁体中文和简体中文)
- EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese) EUC-JP,SHIFT_JIS,ISO-2022-JP(日文)
- EUC-KR, ISO-2022-KR (Korean) EUC-KR,ISO-2022-KR(韩文)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic) KOI8-R,MacCyrillic,IBM855,IBM866,ISO-8859-5,windows-1251(西里尔文)
- ISO-8859-2, windows-1250 (Hungarian) ISO-8859-2,windows-1250(匈牙利语)
- ISO-8859-5, windows-1251 (Bulgarian) ISO-8859-5,windows-1251(保加利亚语)
- windows-1252 (English) windows-1252(英文)
- ISO-8859-7, windows-1253 (Greek) ISO-8859-7,windows-1253(希腊语)
- ISO-8859-8, windows-1255 (Visual and Logical Hebrew) ISO-8859-8,windows-1255(视觉和逻辑希伯来语)
- TIS-620 (Thai) TIS-620(泰国语)
Requires Python 2.1 or later 需要Python 2.1或更高版本
However, some files will be valid in multiple encodings, so chardet
is not a panacea. 但是,有些文件在多种编码中有效,因此chardet
不是灵丹妙药。
Reliably? 可靠? No. 没有。
In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc. 一般来说,字节序列没有意义,除非您知道如何解释它 - 这适用于文本文件,但也适用于整数,浮点数等。
But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). 但是,通过查看字节顺序标记(如果有的话)和文件的第一个块(以查看哪个编码产生最合理的字符),有一些方法可以猜测文件的编码。 The chardet
library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one. chardet
库非常擅长这一点,但要注意它只是一种启发式算法,尽管它是一个相当强大的启发式算法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.