简体   繁体   中英

GAE Python: Importing UTF-8 Characters from an XML file to a database model

I am parsing an XML file from an online source but am having troubles reading utf-8 characters. Now I have read through some of the other questions that treat a similar problem, however none of the solutions so far works. Currently the code looks like below.

class XMLParser(webapp2.RequestHandler):

def get(self):

        url = fetch('some.xml.online')

        xml = parseString(url.content)

        vouchers = xml.getElementsByTagName("VoucherCode")

        for voucher in vouchers:

          if voucher.getElementsByTagName("ActivePartnership")[0].firstChild.data == "true":

            coupon = Coupon()
            coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8'))
            coupon.prov_key = str(voucher.getElementsByTagName("Id")[0].firstChild.data)
            coupon.put()
            self.redirect('/admin/coupon')

The error that I get from this is displayed below. It is caused by a "ü" in the description field, which I will also need to display later on when using the data.

File "C:\\Users\\Vincent\\Documents\\www\\Sparkompass\\Website\\main.py", line 217, in get coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8')) File "C:\\Python27\\lib\\encodings\\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\\xfc' in position 16: ordinal not in range(128)

If I take out the description everything works as it should. In the database model definition I have defined the description as follows:

description = db.StringProperty(multiline=True)

Attempt 2

I have also tried to do it like this:

coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data).decode('utf-8')

Which also gave me:

UnicodeEncodeError: 'ascii' codec can't encode character u'\\xfc' in position 16: ordinal not in range(128)

Any help would be very much appreciated!

UPDATE

The XML file contains German language, meaning that many more of the characters in there are UTF-8 characters. Ideally therefore I am thinking now that it might be better to do the decoding at a higher level, eg at

xml = parseString(url.content)

However so far I didn't get that to work either. The aim is to get the characters in ascii because this is what GAE requires to register it as a string in the database model.

>>> u"ü".decode("utf-8")

UnicodeEncodeError

>>> u"ü".encode("utf-8") 

'\\xc3\\xbc'

>>> u"ü".encode("utf-8").decode("utf-8")

u'\\xfc'

>>> str(u"ü".encode("utf-8").decode("utf-8"))

UnicodeEncodeError

>>> str(u"ü".encode("utf-8"))

'\\xc3\\xbc'

Which encoding do you need?

You could also use:

string2 = cgi.escape(string).encode("latin-1", "xmlcharrefreplace") 

This replaces all non latin-1 chars to xml entities.

I solved the problem for now by changing the description to a TextProperty, which didn't give any error. I am aware that I won't eg be able to sort or filter when doing this but for the description this should be ok.

Background info: https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#TextProperty

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM