简体   繁体   中英

Pyparsing, Python 3 and the Unicode byte order mark

I have a text file which is UTF-8 encoded with the byte order mark present - that is, the first few bytes are EF BB BF 0D 0A 4D... (it's a Visual Studio solution file produced by VS 2013).

I'm trying to parse this with PyParsing, using the parseFile() method and Python 3. In Python 2, I could do this:

import pyparsing as pp
bom = pp.Optional(unicode(unichr(0xfeff)).encode('utf-8')).suppress()

to get an optional byte order mark. But in Python 3, the unicode and unichr functions have gone away because all strings are Unicode. So I tried this:

bom = pp.Optional(chr(0xfeff)).suppress()

and this:

bom = pp.Optional('\ufeff').suppress()

but neither matches the start of the file. I've googled for a while but can't seem to turn up anything relevant.

How can I match (or just ignore!) the Unicode byte order mark?

It seems that the problem here is that the default encoding used when reading a file using the parseFile() method is ASCII, so the UTF-8-encoded byte order mark doesn't end up as U+FEFF , it ends up as ASCII EF BB BF . To work around this, you can open the file explicitly and specify the encoding. Instead of this:

p.parseFile('filename.sln')

do this:

p.parseFile(open('filename.sln', encoding='utf-8'))

Then the byte order mark can be skipped with the following parser:

bom = pp.Optional(chr(0xfeff)).suppress()

Open the file using utf_8_sig encoding type:

p.parseFile(open('filename.sln', encoding='utf_8_sig'))

The BOM will be suppressed if it's present.

From the codecs module:

On encoding a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this is only done once (on the first write to the byte stream). For decoding an optional UTF-8 encoded BOM at the start of the data will be skipped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM