简体   繁体   中英

Parsing Stackoverflow Posts.xml data dump file crashes the program, gives ascii encoding error

I have downloaded Stackoverflow June 2013 data dump and now in the process of parsing the XML files and storing in MySQL database. I am using Python ElementTree to do it and it keeps crashing and giving me encoding errors.

Snippet of parse code:

post = open('a.xml', 'r')
a = post.read()  
tree = xml.parse((a).encode('ascii', 'ignore')) # I also tried .encode('utf-8').strip() it doesn't work

#Get the root node

row = tree.findall("row")

It's giving me following errors:

'ascii' codec can't encode character u'\u2019' in position 248: ordinal not in range(128)

I also tried using the following but the problem persists.

.encode('ascii', 'ignore')

Any advise to fix the problem will be appreciated. Also, if anyone has link to the clean data will also help.

Also, my final goal is to convert the data into RDF, so if anyone has StackOverflow data dump in RDF format, I'll be grateful.

Thanks in advance!

ps This is the XML row that causes problem and crashes the program:

<row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />

Edit: @Arjan the solution you mentioned here doesn't work for me.

You didn't mention which version of Python you were using, and there are differences in how version 2 and version 3 handle unicode, so that may be a factor. Since you are having trouble, my guess is that you are using version 2.x, since version 3 typically handles unicode more gracefully.

ElementTree understands how to parse an xml file (or a string) containing unicode, without the need for str.encode(). Assuming Python 2.7, the code below works to parse an xml file containing the row with the unicode character in your question:

First, here are the contents of an xml file called 'test.xml', created for testing, which includes your problematic row:

<?xml version="1.0"?>
<rows>
    <row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />
</rows>

Code to parse the above file:

>>> import xml.etree.ElementTree as xml
>>> tree = xml.parse('test.xml') # Assuming code lives in same directory as file
>>> # File is now parsed into variable 'tree',
>>> # and we can check the problematic unicode character is in there
>>> body = tree.find('row').attrib['Body']
>>> # We can look at the escaped unicode character...
>>> body [238:256]
the system\u2019s timer
>>> # Or we can view it represented as we would expect to read it
>>> print body[238:256]
the system’s timer

If using this as an example still produces an error for you, perhaps you can provide some additional information about your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM