简体   繁体   English

在 Python 中使用 BOM 字符读取 Unicode 文件数据

[英]Reading Unicode file data with BOM chars in Python

I'm reading a series of source code files using Python and running into a unicode BOM error.我正在使用 Python 读取一系列源代码文件并遇到 unicode BOM 错误。 Here's my code:这是我的代码:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

As you can see, I'm detecting the encoding using chardet , then reading the file in memory and attempting to print it.如您所见,我正在使用chardet检测编码,然后读取内存中的文件并尝试打印它。 The print statement fails on Unicode files containing a BOM with the error:打印语句在包含 BOM 的 Unicode 文件上失败并显示错误:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: UnicodeEncodeError: 'charmap' 编解码器无法对位置 0-2 中的字符进行编码:
character maps to <undefined>字符映射到 <undefined>

I'm guessing it's trying to decode the BOM using the default character set and it's failing.我猜它正在尝试使用默认字符集解码 BOM,但它失败了。 How do I remove the BOM from the string to prevent this?如何从字符串中删除 BOM 以防止出现这种情况?

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:没有理由检查 BOM 是否存在, utf-8-sig会为您管理,如果 BOM 不存在,则其行为与utf-8完全相同:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM.在上面的示例中,您可以看到utf-8-sig正确解码给定的字符串,而不管 BOM 是否存在。 If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it如果您认为您正在阅读的文件中可能存在 BOM 字符的可能性很小,请使用utf-8-sig而不必担心

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding.解码 UTF-16 时应自动剥离 BOM 字符,而不是 UTF-8,除非您明确使用utf-8-sig编码。 You could try something like this:你可以尝试这样的事情:

import io
import chardet
import codecs

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)

if raw.startswith(codecs.BOM_UTF8):
    encoding = 'utf-8-sig'
else:
    result = chardet.detect(raw)
    encoding = result['encoding']

infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

I've composed a nifty BOM-based detector based on Chewie's answer.我根据 Chewie 的回答编写了一个漂亮的基于 BOM 的检测器。 It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce).在数据可以是已知的本地编码或带有 BOM 的 Unicode(这是文本编辑器通常生成的)的常见用例中就足够了。 More importantly, unlike chardet , it doesn't do any random guessing, so it gives predictable results:更重要的是,与chardet不同,它不做任何随机猜测,因此它给出了可预测的结果:

def detect_by_bom(path, default):
    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        if any(raw.startswith(bom) for bom in boms):
            return enc
    return default

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014 :2014 年 10 月 7 日发布的 2.3.0 版本以来, chardet会自动检测 BOM_UTF8

#!/usr/bin/env python
import chardet # $ pip install chardet

# detect file encoding
with open(filename, 'rb') as file:
    raw = file.read(32) # at most 32 bytes are returned
    encoding = chardet.detect(raw)['encoding']

with open(filename, encoding=encoding) as file:
    text = file.read()
print(text)

Note: chardet may return 'UTF-XXLE' , 'UTF-XXBE' encodings that leave the BOM in the text.注意: chardet可能会返回'UTF-XXLE''UTF-XXBE'编码,这些编码将 BOM 留在文本中。 'LE' , 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point eg, as in @ivan_pozdeev's answer . 'LE' , 'BE'应该被剥离以避免它 - 尽管此时自己检测 BOM 更容易,例如在@ivan_pozdeev 的回答中

To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console .要在将 Unicode 文本打印到 Windows 控制台时避免UnicodeEncodeError ,请参阅Python、Unicode 和 Windows 控制台

I find the other answers overly complex.我发现其他答案过于复杂。 There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic ( chardet ) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature ( utf-8-sig vs. the common utf-8 ) that doesn't seem to have an analog in the UTF-16 family.有一种更简单的方法,不需要下降到二进制文件 I/O 的低级习惯用法,不依赖于不属于 Python 标准库的字符集启发式 ( chardet ),并且不依赖需要一个很少见的替代编码签名( utf-8-sig与常见的utf-8 ),它似乎在 UTF-16 系列中没有类似物。

The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting.我发现的最简单的方法是处理 Unicode 中的 BOM 字符,并让编解码器完成繁重的工作。 There is only one Unicode byte order mark , so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy.只有一个 Unicode字节顺序标记,因此一旦数据被转换为 Unicode 字符,确定它是否存在和/或添加/删除它很容易。 To read a file with a possible BOM:要读取具有可能的 BOM 的文件:

BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
    text = f.read()
    if text.startswith(BOM):
        text = text[1:]

This works with all the interesting UTF codecs (eg utf-8 , utf-16le , utf-16be , ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.这适用于所有有趣的 UTF 编解码器(例如utf-8utf-16leutf-16be ,...),不需要额外的模块,也不需要进入二进制文件处理或特定的codec常量。

To write a BOM:要编写 BOM:

text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
    f.write(text_with_BOM)

This works with any encoding.这适用于任何编码。 UTF-16 big endian is just an example. UTF-16 big endian 只是一个例子。

This is not, btw, to dismiss chardet .顺便说一句,这并不是要解雇chardet It can help when you have no information what encoding a file uses.当您不知道文件使用什么编码时,它会有所帮助。 It's just not needed for adding / removing BOMs.添加/删除 BOM 不需要它。

A variant of @ivan_pozdeev's answer for strings/exceptions (rather than files). @ivan_pozdeev 对字符串/异常(而不是文件)的回答的变体。 I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517 )我正在处理填充在 python 异常中的 unicode HTML 内容(参见http://bugs.python.org/issue2517

def detect_encoding(bytes_str):
  for enc, boms in \
      ('utf-8-sig',(codecs.BOM_UTF8,)),\
      ('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
      ('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
    if (any(bytes_str.startswith(bom) for bom in boms): return enc
  return 'utf-8' # default

def safe_exc_to_str(exc):
  try:
    return str(exc)
  except UnicodeEncodeError:
    return unicode(exc).encode(detect_encoding(exc.content))

Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:或者,这个更简单的代码能够毫不费力地删除非 ASCII 字符:

def just_ascii(str):
  return unicode(str).encode('ascii', 'ignore')

In case you want to edit the file, you will want to know which BOM was used.如果您想编辑该文件,您将想知道使用了哪个 BOM。 This version of @ivan_pozdeev answer returns both encoding and optional BOM:这个版本的@ivan_pozdeev 答案返回编码和可选的 BOM:

def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
    """Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """

    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        for bom in boms:
            if raw.startswith(bom):
                return enc, bom
    return default, None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM