简体   繁体   English

如何检测文件是否使用UTF-8编码?

[英]How do I detect if a file is encoded using UTF-8?

Is there a way to recognize if text file is UTF-8 in Python? 有没有办法识别文本文件是否是Python中的UTF-8?

I would really like to get if the file is UTF-8 or not. 如果文件是UTF-8,我真的很想得到。 I don't need to detect other encodings. 我不需要检测其他编码。

You mentioned in a comment you only need to detect UTF-8. 您在评论中提到,您只需要检测UTF-8。 If you know the alternative consists of only single byte encodings, then there is a solution that often works. 如果你知道替代方案只包含单字节编码,那么就有一种解决方案可以正常工作。

If you know it's either UTF-8 or single byte encoding like latin-1 , then try opening it first in UTF-8 and then in the other encoding. 如果你知道它是UTF-8或像latin-1这样的单字节编码,那么先尝试在UTF-8中打开它,然后再在其他编码中打开它。 If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. 如果文件仅包含ASCII字符,则最终将以UTF-8打开,即使它是用作其他编码。 If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two. 如果它包含任何非ASCII字符,则几乎总能正确检测两者之间的正确字符集。

try:
    # or codecs.open on Python <= 2.5
    # or io.open on Python > 2.5 and <= 2.7
    filedata = open(filename, encoding='UTF-8').read() 
except:
    filedata = open(filename, encoding='other-single-byte-encoding').read() 

Your best bet is to use the chardet package from PyPI , either directly or through UnicodeDamnit from BeautifulSoup: 最好的办法是直接或通过BeautifulSoup的UnicodeDamnit使用PyPIchardet

chardet 1.0.1 chardet 1.0.1

Universal encoding detector 通用编码检测器

Detects: 检测:

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants) ASCII,UTF-8,UTF-16(2种变体),UTF-32(4种变体)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese) Big5,GB2312,EUC-TW,HZ-GB-2312,ISO-2022-CN(繁体中文和简体中文)
  • EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese) EUC-JP,SHIFT_JIS,ISO-2022-JP(日文)
  • EUC-KR, ISO-2022-KR (Korean) EUC-KR,ISO-2022-KR(韩文)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic) KOI8-R,MacCyrillic,IBM855,IBM866,ISO-8859-5,windows-1251(西里尔文)
  • ISO-8859-2, windows-1250 (Hungarian) ISO-8859-2,windows-1250(匈牙利语)
  • ISO-8859-5, windows-1251 (Bulgarian) ISO-8859-5,windows-1251(保加利亚语)
  • windows-1252 (English) windows-1252(英文)
  • ISO-8859-7, windows-1253 (Greek) ISO-8859-7,windows-1253(希腊语)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew) ISO-8859-8,windows-1255(视觉和逻辑希伯来语)
  • TIS-620 (Thai) TIS-620(泰国语)

Requires Python 2.1 or later 需要Python 2.1或更高版本

However, some files will be valid in multiple encodings, so chardet is not a panacea. 但是,有些文件在多种编码中有效,因此chardet不是灵丹妙药。

Reliably? 可靠? No. 没有。

In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc. 一般来说,字节序列没有意义,除非您知道如何解释它 - 这适用于文本文件,但也适用于整数,浮点数等。

But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). 但是,通过查看字节顺序标记(如果有的话)和文件的第一个块(以查看哪个编码产生最合理的字符),有一些方法可以猜测文件的编码。 The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one. chardet库非常擅长这一点,但要注意它只是一种启发式算法,尽管它是一个相当强大的启发式算法。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用nltk.data.load()从以UTF-8编码的文件中读取CFG? ASCII文件工作正常,但UTF-8编码的文件给出了错误 - How do I read CFG from a file encoded in UTF-8 using nltk.data.load() ? ASCII files works fine but UTF-8 encoded file gives an error 为什么我使用使用utf-8编码的Windows txt文件在Json中收到ValueError? - Why do I get a ValueError with Json using a windows txt file encoded with utf-8? 如何检查是否已成功在utf-8中进行编码 - How do I check whether have encoded in utf-8 successfully 如何使用 Python 读取 utf-8 编码的文本文件 - How to read a utf-8 encoded text file using Python 如何有效地切片utf-8编码的文件 - How to effectively slice an utf-8 encoded file 检测编码错误的 UTF-8 文本文件中的编码 - Detect encoding in wrongly encoded UTF-8 text file 如何在 python(从 utf-8 编码的文本文件导入)中将组合变音符号 ɔ̃、ɛ̃ 和 ɑ̃ 的字符与非重音字符进行比较? - How do I compare characters with combining diacritic marks ɔ̃, ɛ̃ and ɑ̃ to unaccented ones in python (imported from a utf-8 encoded text file)? 如何使用 Python 将 utf-16 编码的 csv 文件转换为 utf-8? - How to convert a utf-16 encoded csv file to utf-8 using Python? 在Django中,为什么我对utf-8编码的字符串会遇到问题? - In Django, why do I get problems with utf-8 encoded strings? json编码为UTF-8字符。 如何在Python请求中作为json处理 - json encoded as UTF-8 characters. How do I process as json in Python Requests
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM