加快在Python中合并多个XML文件的速度

Question

我正在使用xsl文件合并多个xml文件。 文件数约为100，每个文件有4000个节点。 示例xml和xsl在此SO问题中可用

我的xmlmerge.py如下：

import lxml.etree as ET
import argparse
import os
ap = argparse.ArgumentParser()
ap.add_argument("-x", "--xmlreffile", required=True, help="Path to list of xmls")
ap.add_argument("-s", "--xslfile", required=True, help="Path to the xslfile")
args = vars(ap.parse_args())    
dom = ET.parse(args["xmlreffile"])
xslt = ET.parse(args["xslfile"])
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))

我正在将python的输出写入xmlfile ...所以我运行python脚本的代码如下：

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl

当我在控制台上打印输出时，对于100个文件，如果我尝试将相同的输出保存在xml文件中，则大约需要120分钟

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl >> ~/Documents/mergedxml.xml

这大约需要3天，但合并尚未结束。 我不确定机器是否挂起，因此尝试在另一台机器上仅使用8个文件，并且花费了4个多小时，但合并仍未完成。 我不知道为什么在写入文件时要花这么多时间，而在控制台上打印时却不需要这么多时间。 有人可以指导我吗？

我正在使用Ubuntu 14.04，python 2.7。

Answer 1

为什么不制作脚本的多处理版本。 有几种方法可以做到，但据我了解，这是我会做的

list = open("listofxmls.xml","r")# assuming this gives you a list of files (adapt if necessary)

yourFunction(xml):
    steps 
    of your
    parse funct
    return(ET.tostring(newdom, pretty_print=True))

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4) # number of threads (adapt depending on the task and your CPU)
mergedXML = pool.map(yourFunction,list) # execute the function in parallel
pool.close()
pool.join()

然后，根据需要保存mergedXML。

希望它能帮助或至少引导您朝正确的方向发展

加快在Python中合并多个XML文件的速度

问题描述

1 个解决方案

解决方案1
0 2017-12-28 13:19:58

加快在Python中合并多个XML文件的速度

问题描述

1 个解决方案

解决方案1 0 2017-12-28 13:19:58

解决方案1
0 2017-12-28 13:19:58