I have a text file which is UTF-8 encoded with the byte order mark present - that is, the first few bytes are EF BB BF 0D 0A 4D...
(it's a Visual Studio solution file produced by VS 2013).
I'm trying to parse this with PyParsing, using the parseFile()
method and Python 3. In Python 2, I could do this:
import pyparsing as pp
bom = pp.Optional(unicode(unichr(0xfeff)).encode('utf-8')).suppress()
to get an optional byte order mark. But in Python 3, the unicode
and unichr
functions have gone away because all strings are Unicode. So I tried this:
bom = pp.Optional(chr(0xfeff)).suppress()
and this:
bom = pp.Optional('\ufeff').suppress()
but neither matches the start of the file. I've googled for a while but can't seem to turn up anything relevant.
How can I match (or just ignore!) the Unicode byte order mark?
It seems that the problem here is that the default encoding used when reading a file using the parseFile()
method is ASCII, so the UTF-8-encoded byte order mark doesn't end up as U+FEFF
, it ends up as ASCII EF BB BF
. To work around this, you can open the file explicitly and specify the encoding. Instead of this:
p.parseFile('filename.sln')
do this:
p.parseFile(open('filename.sln', encoding='utf-8'))
Then the byte order mark can be skipped with the following parser:
bom = pp.Optional(chr(0xfeff)).suppress()
Open the file using utf_8_sig
encoding type:
p.parseFile(open('filename.sln', encoding='utf_8_sig'))
The BOM will be suppressed if it's present.
From the codecs module:
On encoding a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this is only done once (on the first write to the byte stream). For decoding an optional UTF-8 encoded BOM at the start of the data will be skipped.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.