简体   繁体   中英

Memory error while parsing large XML files

I have to read large XML files to get matrices of the XML info arranged columnwise.

XML structure as follows

Several lines of no structured heading

<TimeStep TS="1">
<Particle PT="1">
<![CDATA[100,1000]]>
</Particle>
<Particle PT="2">
<![CDATA[200,2000]]>
</Particle>
</TimeStep>

<Timestep TS="2">
<Particle PT="1">
<![CDATA[101,1001]]>
</Particle>
<Particle PT="2">
<![CDATA[202,2002]]>
</Particle>
</TimeStep>

and so on

Target matrix structure is columnwise, as follows
1st column = TimeStep TS
2nd column = Particle PT
3rd & 4th columns = data inside squared backets

1 1 100 1000
1 2 200 2000
2 1 101 1001
2 2 202 2002

So far I managed to do so as below

import numpy as np 
import xml.etree.ElementTree as ET
filename = 'ParticleTrack.xml'         

xmlfile = ET.parse(filename)

tt = xmlfile.findall(".//Particle/../../[@TS]") # picks only TimeSteps with Particles in them (might be TimeSteps with no Particles in them)

data = []
for jj in tt:
    ts = jj.get('TS') 
    pt = jj.findall(".//Particle[@PT]") 
    for ii in range(len(pt)):
        data.append([ts, pt[ii].get('PT'), (pt[ii].text.split(",")[0]), (pt[ii].text.split(",")[1])])

data=np.array(data).astype(np.float)   

My computer has 64GB of RAM and when XML files are somewhat larger than 10 GB I ran out of memory. I am loading the whole XML file at once and writing at the same time the output matrix.

I have read about how to time and memory efficiently streaming parsing large XML files with lxml, iterparse, etc. , but I do not know how to do it with my data.

Thanks I would appreciate any help.

As you mention, for large XML files, consider iterparse for fast stream processing that reads tree incrementally and not all at once. In each iteration extract either from attribute dictionary or text over TimeStep and Particle element:

import numpy as np
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse          # POSSIBLY FASTER

filename = 'ParticleTrack.xml'         
data = []

for event, elem in iterparse(filename, events=("start", "end")):    
    if elem.tag == "TimeStep" and event == 'start':
        TS = elem.attrib['TS']
        elem.clear()
        
    if elem.tag == "Particle" and event == 'start':
        cdata = elem.text.split(',')
        data.append([TS, elem.attrib['PT'], cdata[0], cdata[1]])
        elem.clear()

mat = np.array(data).astype(np.float)   
print(mat)

# [[1.000e+00 1.000e+00 1.000e+02 1.000e+03]
#  [1.000e+00 2.000e+00 2.000e+02 2.000e+03]
#  [2.000e+00 1.000e+00 1.010e+02 1.001e+03]
#  [2.000e+00 2.000e+00 2.020e+02 2.002e+03]]

cElementTree quite is slow, working with lxml, xpath and iters makes it a lot faster,something like this.

import lxml.etree as et
import numpy as np

def process_xml(filename):
    parse_xml = et.ElementTree(et.fromstring(filename)).getroot()
    items = []

    for node in parse_xml.iter('ARTIKEL'):
        ean = node.xpath('.//ARTIKELEAN/text()')
        stock1 = node.xpath('.//INSTOCK/text()')
        huidige_voorraad_excellent.append([ean,stock1])

    dfcols_stock = ['ean','stock1']
    items = pd.DataFrame(huidige_voorraad_excellent,columns=dfcols_stock)
    items = items.applymap(lambda x: x if not isinstance(x, list) else x[0] if len(x) else '')
    return items

data = process_xml(filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM