如何解析大型xml文件中的一些数据？

Question

I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. 我需要从大型xml文件中提取位置和半径数据，格式如下，并将数据存储在二维ndarray中。 This is my first time using Python and I can't find anything about the best way to do this. 这是我第一次使用Python，但我找不到有关最佳方法的信息。

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>

Edit:I mean "large" by human standards. 编辑：我的意思是人类标准的“大”。 I am not having any memory issues with it. 我没有任何内存问题。

Answer 1

You essentially have CSV data in the XML text value. 您基本上在XML文本值中包含CSV数据。

Use ElementTree to parse the XML, then use numpy.genfromtxt() to load that text into an array: 使用ElementTree解析XML，然后使用numpy.genfromtxt()将该文本加载到数组中：

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

Note the generator expression, with a str.splitlines() call; 注意生成器表达式，带有str.splitlines()调用; this turns the text of the XML element into a sequence of lines, which .genfromtxt() is quite happy to receive. 这将XML元素的文本转换为一系列行， .genfromtxt()非常乐意接收。 We do remove the trailing ; 我们确实删除了尾随; character from each line. 每行的字符。

For your sample input (minus the . lines), this results in: 对于您的样本输入（减去.行），这会导致：

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])

Answer 2

If your XML is just that species node, it's pretty simple, and Martijn Pieters has already explained it better than I can. 如果您的XML只是那个species节点，那么它非常简单，Martijn Pieters已经比我更好地解释了它。

But if you've got a ton of species nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse instead of parse : 但是如果文档中有大量的species节点，并且它太大而无法将整个事物放入内存中，则可以使用iterparse而不是parse ：

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

This won't help if you just have one super-gigantic species node, because even iterparse (or similar solutions like a SAX parser) parse one entire node at a time. 如果您只有一个超级巨型species节点，这将无济于事，因为即使是iterparse （或像SAX解析器这样的类似解决方案）也会iterparse解析整个节点。 You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that. 您需要找到一个允许您流式传输大型节点文本的XML库，而且我不认为任何stdlib或流行的第三方解析器都可以做到这一点。

Answer 3

If the file is really large , use ElementTree or SAX . 如果文件非常大 ，请使用ElementTree或SAX 。

If the file is not that large (ie fits into memory), minidom might be easier to work with. 如果文件不是那么大（即适合内存）， minidom可能更容易使用。

Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',') . 每一行似乎是一个逗号分隔数字的简单字符串，所以你可以sipmly做line.split(',') 。

如何解析大型xml文件中的一些数据？

问题描述

3 个解决方案

解决方案1
4 已采纳 2013-06-06 21:17:18

解决方案2
2 2013-06-06 21:21:49

解决方案3
0 2013-06-06 21:13:48

如何解析大型xml文件中的一些数据？

问题描述

3 个解决方案

解决方案1 4 已采纳 2013-06-06 21:17:18

解决方案2 2 2013-06-06 21:21:49

解决方案3 0 2013-06-06 21:13:48

解决方案1
4 已采纳 2013-06-06 21:17:18

解决方案2
2 2013-06-06 21:21:49

解决方案3
0 2013-06-06 21:13:48