I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet
, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig
encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer. It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet
, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet
detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014 :
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet
may return 'UTF-XXLE'
, 'UTF-XXBE'
encodings that leave the BOM in the text. 'LE'
, 'BE'
should be stripped to avoid it -- though it is easier to detect BOM yourself at this point eg, as in @ivan_pozdeev's answer .
To avoid UnicodeEncodeError
while printing Unicode text to Windows console, see Python, Unicode, and the Windows console .
I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic ( chardet
) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature ( utf-8-sig
vs. the common utf-8
) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark , so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (eg utf-8
, utf-16le
, utf-16be
, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec
constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet
. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.
A variant of @ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517 )
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')
In case you want to edit the file, you will want to know which BOM was used. This version of @ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.