简体   繁体   English

使用Python解析大型拆分XML文件

[英]Parse large split XML file(s) with Python

I have a very large XML log file(s) that auto-splits at a fixed size (~200MB). 我有一个很大的XML日志文件,它会以固定大小(〜200MB)自动拆分。 There can be many parts (usually less than 10). 可以有很多部分(通常少于10个)。 When it splits it doesn't do it neatly at end of a record or even at the end of the current line. 当它拆分时,它不会在记录的末尾甚至在当前行的末尾整齐地执行。 It just splits as soon as it hits the target size. 只要达到目标大小,它就会分裂。

Basically I need to parse these files for 'record' elements then pull out the time child from each among other things 基本上,我需要解析这些文件中的“记录”元素,然后将time子项彼此分离

Since these log files split at a random location and don't necessarily have a root, I was using Python3 and lxml's etree.iterparse with html=True . 由于这些日志文件在随机位置分割,并且不一定具有根目录,因此我在使用html=True Python3和lxml的etree.iterparse This is handling the lack of root node due to split files. 这可解决由于分割文件而导致根节点不足的问题。 However, I am not sure how to handle the records that end up being split between the end of one file and the start of another. 但是,我不确定如何处理最终在一个文件的末尾和另一个文件的末尾之间分割的记录。

Here is a small sample of what a split file might look like. 这是一个拆分文件外观的小样本。

FILE: test.001.txt 文件:test.001.txt

<records>
<record>
    <data>5</data>
    <time>1</time>
</record>
<record>
    <data>5</data>
    <time>2</time>
</record>
<record>
    <data>5</data>
    <ti

FILE: test.002.txt 文件:test.002.txt

me>3</time>
</record>
<record>
    <data>6</data>
    <time>4</time>
</record>
<record>
    <data>6</data>
    <time>5</time>
</record>
</records>

Here is what I have tried which I know doesn't work correctly: 这是我尝试过的,我知道它们无法正常运行:

from lxml import etree
xmlFiles      = []
xmlFiles.append('test.001.txt')
xmlFiles.append('test.002.txt')
timeStamps = []
for xmlF in xmlFiles:
    for event, elem in etree.iterparse(xmlF, events=("end",), tag='record',html=True):
        tElem = elem.find('time')
        if tElem is not None:
            timeStamps.append(int(tElem.text))

Output: 输出:

In[20] : timeStamps
Out[20]: [1, 2, 4, 5]

So is there an easy way to capture the 3rd record which is split between files? 那么,有没有一种简单的方法来捕获在文件之间分割的第三记录? I don't really want to merge the files ahead of time since there can be lots of them and they are pretty large. 我真的不想提前合并文件,因为可能有很多文件而且它们很大。 Also, any other speed/ memory management tips besides this Using Python Iterparse For Large XML Files ... I'll figure out how to do that next. 此外,除了本文针对大型XML文件使用Python Iterparse之外,还有其他速度/内存管理技巧……我将找出下一步的方法。 The appending of timeStamps seems like it might be problematic since there could be lots of them ... but I can't really allocate since I have no idea how many there are ahead of time. timeStamps的追加似乎有问题,因为可能有很多……但是我无法真正分配,因为我不知道提前多少个。

Sure. 当然。 Create a class that acts like a file (by providing a read method), but that actually takes input from multiple files while hiding this fact from the caller. 创建就像一个文件(通过提供一个类read法),但实际上需要从多个文件的输入,同时隐藏主叫这一事实。 Something like: 就像是:

class Reader (object):
    def __init__(self):
        self.files = []

    def add(self, src):
        self.files.append(src)

    def read(self, nbytes=0):
        if not len(self.files):
            return bytes()

        data = bytes()
        while True:
            data = data + self.files[0].read(nbytes - len(data))
            if len(data) == nbytes:
                break

            self.files[0].close()
            self.files.pop(0)
            if not len(self.files):
                break

        return data

This class maintains a list of open files. 此类维护一个打开文件列表。 If a read request can't be satisfied by the "topmost" file, that file is closed and a read is attempted from the subsequent file. 如果“最顶层”文件不能满足读取请求,则关闭该文件,并尝试从后续文件中读取。 This continues until we read enough bytes or we run out of files. 这种情况一直持续到我们读取足够的字节或文件用完为止。

Given the above, if we do this: 鉴于以上情况,如果我们这样做:

r = Reader()
for path in ['file1.txt', 'file2.txt']:
    r.add(open(path, 'rb'))

for event, elem in etree.iterparse(r):
    print event, elem.tag

We get (using your sample input): 我们得到(使用您的示例输入):

end data
end time
end record
end data
end time
end record
end data
end time
end record
end data
end time
end record
end data
end time
end record
end records

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM