简体   繁体   中英

Reading Unicode file data with BOM chars in Python

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()


As you can see, I'm detecting the encoding using chardet , then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>

I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
>>> b'hello'.decode('utf-8-sig')

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:

import io
import chardet
import codecs

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)

if raw.startswith(codecs.BOM_UTF8):
    encoding = 'utf-8-sig'
    result = chardet.detect(raw)
    encoding = result['encoding']

infile = io.open(filename, mode, encoding=encoding)
data = infile.read()


I've composed a nifty BOM-based detector based on Chewie's answer. It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet , it doesn't do any random guessing, so it gives predictable results:

def detect_by_bom(path, default):
    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        if any(raw.startswith(bom) for bom in boms):
            return enc
    return default

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014 :

#!/usr/bin/env python
import chardet # $ pip install chardet

# detect file encoding
with open(filename, 'rb') as file:
    raw = file.read(32) # at most 32 bytes are returned
    encoding = chardet.detect(raw)['encoding']

with open(filename, encoding=encoding) as file:
    text = file.read()

Note: chardet may return 'UTF-XXLE' , 'UTF-XXBE' encodings that leave the BOM in the text. 'LE' , 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point eg, as in @ivan_pozdeev's answer .

To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console .

I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic ( chardet ) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature ( utf-8-sig vs. the common utf-8 ) that doesn't seem to have an analog in the UTF-16 family.

The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark , so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:

BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
    text = f.read()
    if text.startswith(BOM):
        text = text[1:]

This works with all the interesting UTF codecs (eg utf-8 , utf-16le , utf-16be , ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.

To write a BOM:

text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:

This works with any encoding. UTF-16 big endian is just an example.

This is not, btw, to dismiss chardet . It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.

A variant of @ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517 )

def detect_encoding(bytes_str):
  for enc, boms in \
    if (any(bytes_str.startswith(bom) for bom in boms): return enc
  return 'utf-8' # default

def safe_exc_to_str(exc):
    return str(exc)
  except UnicodeEncodeError:
    return unicode(exc).encode(detect_encoding(exc.content))

Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:

def just_ascii(str):
  return unicode(str).encode('ascii', 'ignore')

In case you want to edit the file, you will want to know which BOM was used. This version of @ivan_pozdeev answer returns both encoding and optional BOM:

def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
    """Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """

    with open(path, 'rb') as f:
        raw = f.read(4)    # will read less if the file is smaller
    # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
    for enc, boms in \
            ('utf-8-sig', (codecs.BOM_UTF8,)), \
            ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
            ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
        for bom in boms:
            if raw.startswith(bom):
                return enc, bom
    return default, None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM