I'm trying to write a small function to wrap implicit section of an HTML document into section tags. I'm trying to do so with lxml.etree.
Let say my input is:
<html>
<head></head>
<body>
<h1>title</h1>
<p>some text</p>
<h1>title</h1>
<p>some text</p>
</body>
</html>
I'd like to end up with:
<html>
<head></head>
<body>
<section>
<h1>title</h1>
<p>some text</p>
</section>
<section>
<h1>title</h1>
<p>some text</p>
</section>
</body>
</html>
Here is what I have at the moment
def outline(tree):
pattern = re.compile('^h(\d)')
section = None
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
# If a header tag is found
if match:
depth = int(match.group(1))
if section is not None:
child.addprevious(section)
section = lxml.etree.Element('section')
section.append(child)
else:
if section is not None:
section.append(child)
else:
pass
if child is not None:
outline(child)
which I call like this
outline(tree.find('body'))
But it doesn't work with subheadings at the moment, eg.:
<section>
<h1>ONE</h1>
<section>
<h3>TOO Deep</h3>
</section>
<section>
<h2>Level 2</h2>
</section>
</section>
<section>
<h1>TWO</h1>
</section>
Thanks
when it comes to transforming xml, xslt is the best way to go, see lxml and xslt docs.
this is only a direction as requested, let me know if you need further help writing that xslt
Here is the code I ended up with, for the record:
def outline(tree, level=0):
pattern = re.compile('^h(\d)')
last_depth = None
sections = [] # [header, <section />]
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
#print("%s%s" % (level * ' ', child))
if match:
depth = int(match.group(1))
if depth <= last_depth or last_depth is None:
#print("%ssection %d" % (level * ' ', depth))
last_depth = depth
sections.append([child, lxml.etree.Element('section')])
continue
if sections:
sections[-1][1].append(child)
for section in sections:
outline(section[1], level=((level + 1) * 4))
section[0].addprevious(section[1])
section[1].insert(0, section[0])
Works pretty well for me
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.