简体   繁体   中英

Google App Engine: UnicodeDecode Error in bulk data upload

I'm getting an odd error with Google App Engine devserver 1.3.5, and Python 2.5.4, on Windows.

A sample row in the CSV:

EQS,550,foobar,"<some><html><garbage /></html></some>",odp,Ti4=,http://url.com,success

The error:

..................................................................................................................[ERROR   ] [Thread-1] WorkerThread:
Traceback (most recent call last):
  File "C:\Program Files\Google\google_appengine\google\appengine\tools\adaptive_thread_pool.py", line 150, in WorkOnItems
    status, instruction = item.PerformWork(self.__thread_pool)
  File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 695, in PerformWork
    transfer_time = self._TransferItem(thread_pool)
  File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 852, in _TransferItem
    self.request_manager.PostEntities(self.content)
  File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 1296, in PostEntities
    datastore.Put(entities)
  File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 282, in Put
    req.entity_list().extend([e._ToPb() for e in entities])
  File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 687, in _ToPb
    properties = datastore_types.ToPropertyPb(name, values)
  File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_types.py", line 1499, in ToPropertyPb
    pbvalue = pack_prop(name, v, pb.mutable_value())
  File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_types.py", line 1322, in PackString
    pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 36: ordinal not in range(128)
[INFO    ] Unexpected thread death: Thread-1
[INFO    ] An error occurred. Shutting down...
..[ERROR   ] Error in Thread-1: 'ascii' codec can't decode byte 0xe8 in position 36: ordinal not in range(128)

Is the error being generated by an issue with a base64 string, of which there is one in every row?

KGxwMAoobHAxCihTJ0JJT0VFJwpwMgpJMjYxMAp0cDMKYWEu

KGxwMAoobHAxCihTJ01BVEgnCnAyCkkyOTQwCnRwMwphYS4=

The data loader:

class CourseLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, 'Course',
                                   [('dept_code', str),
                                    ('number', int),
                                    ('title', str),
                                    ('full_description', str),
                                    ('unparsed_pre_reqs', str),
                                    ('pickled_pre_reqs', lambda x: base64.b64decode(x)),
                                    ('course_catalog_url', str),
                                    ('parse_succeeded', lambda x: x == 'success')
                                   ])

loaders = [CourseLoader]

Is there a way to tell from the traceback which row caused the error?

UPDATE : It looks like there are two characters causing errors: è , and ® . How can I get Google App Engine to handle them?

Looks like some row of the CSV has some non-ascii data (maybe a LATIN SMALL LETTER E WITH GRAVE -- that's what 0xe8 would be in ISO-8859-1, for example) and yet you're mapping it to str (should be unicode , and I believe the CSV should be in utf-8).

To find if any row of a text file has non-ascii data, a simple Python snippet will help, eg:

>>> f = open('thefile.csv')
>>> prob = []
>>> for i, line in enumerate(f):
...   try: unicode(line)
...   except: prob.append(i)
...
>>> print 'Problems in %d lines:' % len(prob)
>>> print prob

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM