简体   繁体   中英

How to speed up the transformation of large xml files with xslt using python lxml

Below is my source code:

from lxml import etree as ET

tree = ET.parse("test.xml")
xslt = ET.parse("test.xsl")
transform = ET.XSLT(xslt)

print "before transform"
newTree = transform(tree)
print "after transform"
print str(newTree)

When test.xml is small, the script works well. When test.xml is big (>100MB or GB), the script will run for a long time.

I find the bottleneck is "newTree = transform(tree)".

Is there any other methods to transform xml files with xslt in python lxml?

If you found that the bottleneck is

newTree = transform(tree)

then your question is not about how to speed up parsing XML. The parsing is done beforehand, the documents are read into memory (as an ElementTree-like structure) here:

tree = ET.parse("test.xml")
xslt = ET.parse("test.xsl")

So, perhaps you meant to ask:

Can I speed up the transformation of large input files?

The answer depends on the kind of operations present in your code. lxml is not a Swiss army knife (and neither is any other piece of software, for that matter). There are operations where lxml is virtually unbeatable and others where it is clearly outperformed by similar libraries like cElementTree .

For example, tree traversal (think of it as changing the context node) is said to be very fast, whereas generating new elements is costly when compared to cET . Considering parsing, serialization and the size of documents:

whenever the input documents are not considerably bigger than the output, lxml is the clear winner.

this is taken from here where you will find an awful lot of information on the subject.

If by "transformation" you mean "applying XSLT stylesheets", the considerations above will not be of much use. lxml uses libxslt for this - which is a library in its own right.


Is there any other methods to parse xml files with xsl in python lxml?

There are other libraries like cElementTree . However, I have used it only to handle XML input - and probably it would be cumbersome to apply XSLT stylesheets.

But before you jump to conclusions you should identify the operations that are present in your stylesheet, compare input and output sizes and study lxml performance or the performance of your stylesheet .

You should be aware that an XML file of 1 GB is extremely large and I would not expect it to be parsed or transformed smoothly anywhere.

I have found a way to improve the performance of transforming XML files with xslt.

results = ""
tree = ET.iterparse(xml_file)
xslt = ET.parse(xsl_file)
transform = ET.XSLT(xslt)

for elem in tree:
    if (re.search("ContentItem", elem[1].tag)):
        newElem = transform(elem[1])
        #print str(newTree)
        results = results + str(newElem)

print results

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM