简体   繁体   中英

Parsing UTF-8 XML-Files with Python

i've got an XML-File with contains some german umlauts. My goal is to read in the file and store the results into a database. For testing I got two different files. The first is according to chardet UTF-8-SIG the other one is UTF-8 .

Preprocessing the data is done by unicode(field[0]) after reading the file with lxml

Parsing the first file works fine, but processing the other results in an encoding error: UnicodeEncodeError: 'ascii' codec can't encode characters in position: ordinal not in range(128)

For example such string can be u'Zubeh\\xf6r' ( print(field[0] ).

Using print (field[0].encode("utf-8")) results in the right string, but the type is str instead of unicode

Try

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')

when you read the data with lxml.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM