简体   繁体   中英

Trying to parse large xml file in Python - Memory Errors

So I'm a beginner 'scraper' with not a whole truckload of programming experience.

I'm using Python, in the Canopy environment, to scrape up some downloaded XML files and using the xml.dom parser to do so. I'm simply trying to scrape the tags from the very first us-bibliographic-patent-grant (which is why I'm using the [0] ) just to see how I want to parse and store the entire dataset; rather than doing it all at once. An excerpt from the xml looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0606726-20091229.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20091214" date-publ="20091229">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0606726</doc-number>
<kind>S1</kind>
<date>20091229</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29299001</doc-number>
<date>20071217</date>

My code so far looks like this:

from xml.dom import minidom

filename = "C:/Users/SMOLENSK/Documents/Inventor Research/xml_2009/ipg091229.xml"

f = open(filename, 'r')

doc = f.read()

f.close()

xmldata = '<root>' + doc + '</root>'

data = minidom.parse(xmldata)

US_Biblio = xmldata.getElementsByTagName("us-bibliographic-data-grant")[0]

pat_num = US_Biblio.getElementsByTagName("doc-number")[0]

dates = pat_num.getElementsByTagName("date")

for date in dates:
    print(date)

Now I have gotten some messages for Memory Errors after the code fully ran but it has only been able to run once and unfortunately I was unable to jot down what exactly happened. Due to the high load of data (this file alone being 4.6 million lines) the operation crashes most every time and I'm unable to replicate the Errors.

Is there anything anyone can see wrong with the code? My code is parsing the entire dataset before it starts storing each tag name but might there be a way to parse only a certain amount? Perhaps just make a new xml file with the first set.

If you're wondering I used the to bypass the issue of the

ExpatError: junk after line xxx

I was getting beforehand. I know my coding skills aren't amazing so hopefully i did not make a simple and disgusting programming error.

Try:

with open(filename, 'r') as f:
    data = minidom.parse(f)

If you really need the tag you may need to mess around a bit, maybe:

data = minidom.parse(itertools.chain('<root>', f, '</root>')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM