How to effectively slice an utf-8 encoded file

Question

I'm having trouble slicing a utf-8 encoded file. After opening using codecs, slicing a portion becomes difficult due to byte order marks (BOM) characters at the beginning that cause a shift.

See details of my attempts below.

def readfiles(filepaf):
    with codecs.open(filepaf,'r', 'utf-8') as f:
        g=f.read()
        q=' '.join(g.split())
        return q

q=readfiles(c:xxx)

q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door...

>>> q[0:100]
u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'


>>> q[0:100].encode('utf-8')
'\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'

The only accurate result comes by directly printing a sliced portion, but my program makes use of sliced portions rather than printing, and most often the sliced portions are inaccurate due to the shift at the beginning.

Ideal output

Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin

Any suggestions on how to slice without having BOM characters at the beginning?

Answer 1

Discard bytes that start with bits 10 from the beginning of the slice until you find a byte that doesn't. That one will start a new character. You'll have to skip at most 3 bytes.

Alternatively you can slice the Unicode string, that will not give you broken characters.

Note that \ is a valid character: it's the zero width non-breaking space, that some broken text editors insert into the beginning of UTF8 files to identify them. If you want to skip it use the utf-8-sig encoding.

How to effectively slice an utf-8 encoded file

Question

1 answers

solution1
1 2014-04-11 06:19:02

How to effectively slice an utf-8 encoded file

Question

1 answers

solution1 1 2014-04-11 06:19:02

solution1
1 2014-04-11 06:19:02