简体   繁体   中英

How to effectively slice an utf-8 encoded file

I'm having trouble slicing a utf-8 encoded file. After opening using codecs, slicing a portion becomes difficult due to byte order marks (BOM) characters at the beginning that cause a shift.

See details of my attempts below.

def readfiles(filepaf):
    with codecs.open(filepaf,'r', 'utf-8') as f:
        g=f.read()
        q=' '.join(g.split())
        return q

q=readfiles(c:xxx)

q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door...

>>> q[0:100]
u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'


>>> q[0:100].encode('utf-8')
'\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'

The only accurate result comes by directly printing a sliced portion, but my program makes use of sliced portions rather than printing, and most often the sliced portions are inaccurate due to the shift at the beginning.

Ideal output

Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin

Any suggestions on how to slice without having BOM characters at the beginning?

Discard bytes that start with bits 10 from the beginning of the slice until you find a byte that doesn't. That one will start a new character. You'll have to skip at most 3 bytes.

Alternatively you can slice the Unicode string, that will not give you broken characters.

Note that \ is a valid character: it's the zero width non-breaking space, that some broken text editors insert into the beginning of UTF8 files to identify them. If you want to skip it use the utf-8-sig encoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM