I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando):
id = db.IntegerProperty()
name=db.StringProperty()
content=db.BlobProperty()
Now my below is my code
parser = make_parser()
curHandler = BasketBallHandler()
parser.setContentHandler(curHandler)
for q in XMLdocs.all():
parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
handler.post(*groups)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
self.handle()
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
scan_aborted = not self.process_entity(entity, ctx)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
handler(entity)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
parser.parse(StringIO.StringIO(q.content))
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters
print ch
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails
print ch.encode('ascii', 'ignore')
The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
只需将.encode('utf-8')
放在对象的末尾即可在最新版本的Python中完成工作。
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs
content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))
I used strip
instead of lstrip
because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
This worked for me:
from django.utils.encoding import smart_str
content = smart_str(content)
The problem according to your traceback is the print
statement on line 136 of parseXML.py
. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:
print repr(ch)
then you should at least see what you are trying to print.
The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace
option before printing it, eg print ch.encode(sys.stdout.encoding, 'replace')
.
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.