简体   繁体   中英

lxml (or lxml.html): print tree structure

I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).

What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.

Any idea? Is there something in lxml to do that?

If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)

FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.

Thanks

Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring to get a string you could hash...

from lxml import etree as ET

def pp(e):
    print ET.tostring(e, pretty_print=True)
    print

root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
  <boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)


xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="*"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)

doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)

Gives you the output:

<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
  <livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
  <livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
  <preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
    <boolean id="import_live">0</boolean>
  </preference-set>
</project>

<project>
  <livefolder/>
  <livefolder/>
  <preference-set>
    <boolean/>
  </preference-set>
</project>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM