Parsing UTF-8 XML-Files with Python

Question

i've got an XML-File with contains some german umlauts. My goal is to read in the file and store the results into a database. For testing I got two different files. The first is according to chardet UTF-8-SIG the other one is UTF-8 .

Preprocessing the data is done by unicode(field[0]) after reading the file with lxml

Parsing the first file works fine, but processing the other results in an encoding error: UnicodeEncodeError: 'ascii' codec can't encode characters in position: ordinal not in range(128)

For example such string can be u'Zubeh\\xf6r' ( print(field[0] ).

Using print (field[0].encode("utf-8")) results in the right string, but the type is str instead of unicode

Answer 1

Try

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')

when you read the data with lxml.

Parsing UTF-8 XML-Files with Python

Question

1 answers

solution1
0 2015-08-25 03:41:25

Parsing UTF-8 XML-Files with Python

Question

1 answers

solution1 0 2015-08-25 03:41:25

solution1
0 2015-08-25 03:41:25