简体   繁体   中英

Reading UTF-8 file returns unexpected chars

Running Windows 8 64-bit. I have a file where I store some data, saved with the UTF-8 encoding using Windows notepad. Supposing this is the content of the file:

1,some,data,here,0,-1

I'm reading it like this:

f = open("file.txt", "rb")
f.read()
f.close()

And f.read() returns this:

u"\\xef\\xbb\\xbf1,some,data,here,0,-1"

I can just use f.read()[3:] but that's not a clean solution.

What are those characters at the beginning of the file?

Those first 3 bytes are the UTF-8 BOM, or Byte Order Mark. UTF-8 doesn't need the BOM (it has a fixed byte order unlike UTF-16 and UTF-32), but many tools (mostly Microsoft's) add it anyway to aid in file-encoding detection.

You can test for it and skip it safely, use codecs.BOM_UTF8 to handle it:

import codecs

data = f.read()
if data.startswith(codecs.BOM_UTF8):
    data = data[3:]

You could also use the io.open() function to open the file and have Python decode the file for you to Unicode, and tell it to use the utf_8_sig codec:

import io

with io.open('file.txt', encoding='utf_8_sig'):
    data = f.read()

That´s the BOM (byte order mark).
In reality, UTF-8 has only one valid byte order,
but despite of that there can be this 3-byte-sequence
at the beginning of the file (data in general).

-> If there are exactly these values as first 3 bytes, ignore them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM