简体   繁体   中英

wrap implicit section of an HTML document into section tags using lxml.etree

I'm trying to write a small function to wrap implicit section of an HTML document into section tags. I'm trying to do so with lxml.etree.

Let say my input is:

<html>
    <head></head>
    <body>
        <h1>title</h1>
        <p>some text</p>
        <h1>title</h1>
        <p>some text</p>
    </body>
</html>

I'd like to end up with:

<html>
    <head></head>
    <body>
        <section>
            <h1>title</h1>
            <p>some text</p>
        </section>
        <section>
            <h1>title</h1>
            <p>some text</p>
        </section>
    </body>
</html>

Here is what I have at the moment

def outline(tree):
    pattern = re.compile('^h(\d)')
    section = None

    for child in tree.iterchildren():
        tag = child.tag

        if tag is lxml.etree.Comment:
            continue

        match = pattern.match(tag.lower())

        # If a header tag is found
        if match:
            depth = int(match.group(1))

            if section is not None:
                child.addprevious(section)

            section = lxml.etree.Element('section')
            section.append(child)

        else:
            if section is not None:
                section.append(child)
            else:
                pass

        if child is not None:
            outline(child)

which I call like this

 outline(tree.find('body'))

But it doesn't work with subheadings at the moment, eg.:

<section>
    <h1>ONE</h1>
    <section>
        <h3>TOO Deep</h3>
    </section>
    <section>
        <h2>Level 2</h2>
    </section>
</section>
<section>
    <h1>TWO</h1>
</section>

Thanks

when it comes to transforming xml, xslt is the best way to go, see lxml and xslt docs.

this is only a direction as requested, let me know if you need further help writing that xslt

Here is the code I ended up with, for the record:

def outline(tree, level=0):
    pattern = re.compile('^h(\d)')
    last_depth = None
    sections = [] # [header, <section />]

    for child in tree.iterchildren():
        tag = child.tag

        if tag is lxml.etree.Comment:
            continue

        match = pattern.match(tag.lower())
        #print("%s%s" % (level * ' ', child))

        if match:
            depth = int(match.group(1))

            if depth <= last_depth or last_depth is None:
                #print("%ssection %d" % (level * ' ', depth))
                last_depth = depth

                sections.append([child, lxml.etree.Element('section')])
                continue

        if sections:
            sections[-1][1].append(child)

    for section in sections:
        outline(section[1], level=((level + 1) * 4))
        section[0].addprevious(section[1])
        section[1].insert(0, section[0])

Works pretty well for me

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM