i've got an XML-File with contains some german umlauts. My goal is to read in the file and store the results into a database. For testing I got two different files. The first is according to chardet UTF-8-SIG the other one is UTF-8 .
Preprocessing the data is done by unicode(field[0])
after reading the file with lxml
Parsing the first file works fine, but processing the other results in an encoding error: UnicodeEncodeError: 'ascii' codec can't encode characters in position: ordinal not in range(128)
For example such string can be u'Zubeh\\xf6r' ( print(field[0]
).
Using print (field[0].encode("utf-8"))
results in the right string, but the type is str
instead of unicode
Try
from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
when you read the data with lxml.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.