Reading UTF-8 file returns unexpected chars

Question

Running Windows 8 64-bit. I have a file where I store some data, saved with the UTF-8 encoding using Windows notepad. Supposing this is the content of the file:

1,some,data,here,0,-1

I'm reading it like this:

f = open("file.txt", "rb")
f.read()
f.close()

And f.read() returns this:

u"\\xef\\xbb\\xbf1,some,data,here,0,-1"

I can just use f.read()[3:] but that's not a clean solution.

What are those characters at the beginning of the file?

Answer 1

Those first 3 bytes are the UTF-8 BOM, or Byte Order Mark. UTF-8 doesn't need the BOM (it has a fixed byte order unlike UTF-16 and UTF-32), but many tools (mostly Microsoft's) add it anyway to aid in file-encoding detection.

You can test for it and skip it safely, use codecs.BOM_UTF8 to handle it:

import codecs

data = f.read()
if data.startswith(codecs.BOM_UTF8):
    data = data[3:]

You could also use the io.open() function to open the file and have Python decode the file for you to Unicode, and tell it to use the utf_8_sig codec:

import io

with io.open('file.txt', encoding='utf_8_sig'):
    data = f.read()

Answer 2

That´s the BOM (byte order mark).
In reality, UTF-8 has only one valid byte order,
but despite of that there can be this 3-byte-sequence
at the beginning of the file (data in general).

-> If there are exactly these values as first 3 bytes, ignore them.

Reading UTF-8 file returns unexpected chars

Question

2 answers

solution1
2 ACCPTED 2014-03-07 18:35:38

solution2
1 2014-03-07 18:29:33

Reading UTF-8 file returns unexpected chars

Question

2 answers

solution1 2 ACCPTED 2014-03-07 18:35:38

solution2 1 2014-03-07 18:29:33

solution1
2 ACCPTED 2014-03-07 18:35:38

solution2
1 2014-03-07 18:29:33