简体   繁体   English

有效地比较python中的两个XML文件

[英]Efficiently compare two XML files in python

I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. 我正在尝试找到一种有效的方法来比较两个XML文件并处理python脚本中的差异。 The scenario is that I have two XML files similar to the following on: 场景是我有两个类似于以下的XML文件:

<?xml version="1.0" encoding="UTF-8"?> 
<garage> 
    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>10000</price>
    </car> 
    <car> 
        <color>blue</color> 
        <size>big</size> 
        <price>10000</price>

    <!-- [...] -->

    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>11000</price>
    </car> 
    </car> 
</garage>

Those XML files contain thousands of small objects. 这些XML文件包含数千个小对象。 The files themselves have a size of about 5 MB. 文件本身大小约为5 MB。 The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. 棘手的是,这两个文件中只有极少数条目不同,我只需要处理不同的信息。 With other words: I need to efficiently (!) find out, which of the entries changed or have been added. 换句话说:我需要有效地(!)找出哪些条目已更改或已添加。 Unfortunately the XML files also contain some optional entries that I don't care about at all. 不幸的是,XML文件还包含一些我根本不关心的可选条目。

I considered the following solutions: 我考虑了以下解决方案:

  1. Parse both files into a DOM tree and compare them in a loop 将两个文件解析为DOM树并在循环中对它们进行比较
  2. Parse both files into sets and use operators like set.difference 将两个文件解析为集合并使用set.difference等运算符
  3. Try to hand some of the processing over to some linux tools like grep and diff 尝试将一些处理交给一些linux工具,如grep和diff

Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into? 这里有没有人有这些方法的表现经验,可以指导我走进去的方向吗?

Create a cached intermediate format that only has the stuff you care about comparing. 创建一个缓存的中间格式,只包含您关心比较的内容。 When comparing two files, A.xml & B.xml , compare their A.cached and B.cached instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). 比较两个文件, A.xmlB.xml ,比较它们的A.cachedB.cached ,如果丢失则生成它们并删除文件更改(或根据时间戳等重新生成)。 The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries. 生成成本将通过多次比较进行摊销,您不会迭代不必要的条目。

The format of " .cached " really depends on what you care about and how much information/context you need. .cached ”的格式实际上取决于您关心的内容以及您需要的信息/上下文。 It could perhaps even potentially have a binary representation 它甚至可能具有二进制表示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM