如何检测文件是否使用UTF-8编码？

Question

Is there a way to recognize if text file is UTF-8 in Python? 有没有办法识别文本文件是否是Python中的UTF-8？

I would really like to get if the file is UTF-8 or not. 如果文件是UTF-8，我真的很想得到。 I don't need to detect other encodings. 我不需要检测其他编码。

Answer 1

You mentioned in a comment you only need to detect UTF-8. 您在评论中提到，您只需要检测UTF-8。 If you know the alternative consists of only single byte encodings, then there is a solution that often works. 如果你知道替代方案只包含单字节编码，那么就有一种解决方案可以正常工作。

If you know it's either UTF-8 or single byte encoding like latin-1 , then try opening it first in UTF-8 and then in the other encoding. 如果你知道它是UTF-8或像latin-1这样的单字节编码，那么先尝试在UTF-8中打开它，然后再在其他编码中打开它。 If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. 如果文件仅包含ASCII字符，则最终将以UTF-8打开，即使它是用作其他编码。 If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two. 如果它包含任何非ASCII字符，则几乎总能正确检测两者之间的正确字符集。

try:
    # or codecs.open on Python <= 2.5
    # or io.open on Python > 2.5 and <= 2.7
    filedata = open(filename, encoding='UTF-8').read() 
except:
    filedata = open(filename, encoding='other-single-byte-encoding').read()

Your best bet is to use the chardet package from PyPI , either directly or through UnicodeDamnit from BeautifulSoup: 最好的办法是直接或通过BeautifulSoup的UnicodeDamnit使用PyPI的chardet包：

chardet 1.0.1 chardet 1.0.1

Universal encoding detector 通用编码检测器

Detects: 检测：

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants) ASCII，UTF-8，UTF-16（2种变体），UTF-32（4种变体）

Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese) Big5，GB2312，EUC-TW，HZ-GB-2312，ISO-2022-CN（繁体中文和简体中文）

EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese) EUC-JP，SHIFT_JIS，ISO-2022-JP（日文）

EUC-KR, ISO-2022-KR (Korean) EUC-KR，ISO-2022-KR（韩文）

KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic) KOI8-R，MacCyrillic，IBM855，IBM866，ISO-8859-5，windows-1251（西里尔文）

ISO-8859-2, windows-1250 (Hungarian) ISO-8859-2，windows-1250（匈牙利语）

ISO-8859-5, windows-1251 (Bulgarian) ISO-8859-5，windows-1251（保加利亚语）

windows-1252 (English) windows-1252（英文）

ISO-8859-7, windows-1253 (Greek) ISO-8859-7，windows-1253（希腊语）

ISO-8859-8, windows-1255 (Visual and Logical Hebrew) ISO-8859-8，windows-1255（视觉和逻辑希伯来语）

TIS-620 (Thai) TIS-620（泰国语）

Requires Python 2.1 or later 需要Python 2.1或更高版本

However, some files will be valid in multiple encodings, so chardet is not a panacea. 但是，有些文件在多种编码中有效，因此chardet不是灵丹妙药。

Answer 2

Reliably? 可靠？ No. 没有。

In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc. 一般来说，字节序列没有意义，除非您知道如何解释它 - 这适用于文本文件，但也适用于整数，浮点数等。

But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). 但是，通过查看字节顺序标记（如果有的话）和文件的第一个块（以查看哪个编码产生最合理的字符），有一些方法可以猜测文件的编码。 The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one. chardet库非常擅长这一点，但要注意它只是一种启发式算法，尽管它是一个相当强大的启发式算法。

如何检测文件是否使用UTF-8编码？

问题描述

2 个解决方案

解决方案1
19 已采纳 2012-04-14 18:19:47

chardet 1.0.1 chardet 1.0.1

解决方案2
3 2012-04-14 18:20:38

如何检测文件是否使用UTF-8编码？

问题描述

2 个解决方案

解决方案1 19 已采纳 2012-04-14 18:19:47

chardet 1.0.1 chardet 1.0.1

解决方案2 3 2012-04-14 18:20:38

解决方案1
19 已采纳 2012-04-14 18:19:47

解决方案2
3 2012-04-14 18:20:38