简体   繁体   English

如何解析大型xml文件中的一些数据?

[英]How do I parse some of the data from a large xml file?

I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. 我需要从大型xml文件中提取位置和半径数据,格式如下,并将数据存储在二维ndarray中。 This is my first time using Python and I can't find anything about the best way to do this. 这是我第一次使用Python,但我找不到有关最佳方法的信息。

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>

Edit:I mean "large" by human standards. 编辑:我的意思是人类标准的“大”。 I am not having any memory issues with it. 我没有任何内存问题。

You essentially have CSV data in the XML text value. 您基本上在XML文本值中包含CSV数据。

Use ElementTree to parse the XML, then use numpy.genfromtxt() to load that text into an array: 使用ElementTree解析XML,然后使用numpy.genfromtxt()将该文本加载到数组中:

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

Note the generator expression, with a str.splitlines() call; 注意生成器表达式,带有str.splitlines()调用; this turns the text of the XML element into a sequence of lines, which .genfromtxt() is quite happy to receive. 这将XML元素的文本转换为一系列行, .genfromtxt()非常乐意接收。 We do remove the trailing ; 我们确实删除了尾随; character from each line. 每行的字符。

For your sample input (minus the . lines), this results in: 对于您的样本输入(减去.行),这会导致:

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])

If your XML is just that species node, it's pretty simple, and Martijn Pieters has already explained it better than I can. 如果您的XML只是那个species节点,那么它非常简单,Martijn Pieters已经比我更好地解释了它。

But if you've got a ton of species nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse instead of parse : 但是如果文档中有大量的species节点,并且它太大而无法将整个事物放入内存中,则可以使用iterparse而不是parse

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

This won't help if you just have one super-gigantic species node, because even iterparse (or similar solutions like a SAX parser) parse one entire node at a time. 如果您只有一个超级巨型species节点,这将无济于事,因为即使是iterparse (或像SAX解析器这样的类似解决方案)也会iterparse解析整个节点。 You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that. 您需要找到一个允许您流式传输大型节点文本的XML库,而且我不认为任何stdlib或流行的第三方解析器都可以做到这一点。

If the file is really large , use ElementTree or SAX . 如果文件非常大 ,请使用ElementTreeSAX

If the file is not that large (ie fits into memory), minidom might be easier to work with. 如果文件不是那么大(即适合内存), minidom可能更容易使用。

Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',') . 每一行似乎是一个逗号分隔数字的简单字符串,所以你可以sipmly做line.split(',')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM