简体   繁体   中英

Sorting the xml file based on xml text attribute

I have a xml file in which elements are present in some random order. I have to compare these files but due to the change in order of elements, it requires manual effort.

I am looking for some way to sort these files. Can someone please give me some pointers/approach to this problem. I tried reading the documentation of lxml (ElementTree and Element classes), but there doesn't seems to be a method by which I can sort the children elements based on xml text.

I can sort the elements based on Name, but within an attribute element, how can the legal element childs can be sorted?

Input :-

<root>
    <attribute Name="attr2">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype2</o>
                <o>otype1</o>
              </legal>
            </objects>
    </attribute>
    <attribute Name="attr1">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype2</o>
                <o>otype1</o>
              </legal>
            </objects>
    </attribute>
</root>

Expected Output :

<root>
    <attribute Name="attr1">
            <v>
              <cstat>
                <s>nObjDef1</s>
                <s>nObjDef2</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype1</o>
                <o>otype2</o>
              </legal>
            </objects>
    </attribute>
    <attribute Name="attr2">
            <v>
              <cstat>
                <s>nObjDef1</s>
                <s>nObjDef2</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype1</o>
                <o>otype2</o>
              </legal>
            </objects>
    </attribute>
</root> 

If you want to sort the children by the text, just find the legal nodes and sort the children using child.text as the key:

x = """<root>
    <attribute Name="attr2">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype2</o>
                <o>otype1</o>
              </legal>
            </objects>
    </attribute>
    <attribute Name="attr1">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype2</o>
                <o>otype1</o>
              </legal>
            </objects>
    </attribute>
</root>
"""

The to sort each node:

from lxml import etree

xml = etree.fromstring(x)

for node in xml.xpath("//legal"):
    node[:] = sorted(node, key=lambda ch: ch.text)

That will reorder the children:

print(etree.tostring(xml, pretty_print=1).decode("utf-8"))

Giving you:

<root>
    <attribute Name="attr2">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype1</o>
              <o>otype2</o>
                </legal>
            </objects>
    </attribute>
    <attribute Name="attr1">
            <v>
              <cstat>
                <s>nObjDef2</s>
                <s>nObjDef1</s>
              </cstat>
            </v>
            <objects>
              <legal>
                <o>otype1</o>
              <o>otype2</o>
                </legal>
            </objects>
    </attribute>
</root>

Or a more efficient approach, use operator.attrgetter in place of the lambda:

from lxml import etree
from operator import attrgetter
xml = etree.fromstring(x)

for node in xml.xpath("//legal"):
    node[:] = sorted(node, key=attrgetter("text"))

Consider XSLT , the special purpose language designed specifically to manipulate and transform XML files. Python's lxml can run XSLT 1.0 scripts. Specifically, XSLT maintains the <xsl:sort> method which can be run inside templates:

import lxml.etree as et

# LOAD XML (FROM FILE) AND XSL (FROM STRING)
xml = et.parse('Input.xml')

xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>  

  <!-- Sort Children Text of Nodes -->
  <xsl:template match="cstat|legal">
    <xsl:copy>
      <xsl:apply-templates select="*">
        <xsl:sort select="." order="ascending" data-type="text"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>

</xsl:transform>'''

xslt = et.fromstring(xslstr)

# TRANSFORM SOURCE TO NEW TREE
transform = et.XSLT(xslt)
newdom = transform(xml)
print(newdom)

# OUTPUT TO FILE
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)

xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM