简体   繁体   中英

postordering depth-first-search (DFS) of HTML, using python, lxml, etree

This is not an DFS algorithm question, or library-suggestion question. It is specifically about lxml.etree (v 4). I use python 3.9.

This library, lxml.etree , provides a way to iterate over the ElmentTree into which an HTML code is parsed. The iterator is DFS, but preordering (using the term from Wikipedia article on DFS). It means the elements are yielded in the order of first visit. My question is then what is the easy way to implement the postorder iteration.

Here is a minimal code demonstrating that the default order of iter() is the pre-order. I created a dummy funciton so the second test obviously fails. I need an implementaiton of _iter_postorder for the assertion to hold true.

import unittest
from typing import List
from xml.etree.ElementTree import ElementTree

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def _iter_postorder(tree: ElementTree) -> List[str]:
    return []

class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postordering(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root) 
        result = _iter_postorder(tree)
        self.assertListEqual(result, ["c11", "c12", "c1"])

I have coded it myself, using recursion and element.iterchildren , but I am disappointed there is nothing out of the box. Here is the solution

import unittest
from typing import List, Union, Generator
from xml.etree.ElementTree import ElementTree, Element

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def iter_postorder(el: Element, result: List[Element] = []) -> List[Element]:
    for child in el.iterchildren():
        iter_postorder(child, result)
    result.append(el)
    return result


def iter_postorder2(el: Element) -> Generator[Element, Element, None]:
    for child in el.iterchildren():
        yield from iter_postorder2(child)
    yield el


class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postorder(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])

    def test_dfs_postorder_generator(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder2(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])


if __name__ == '__main__':
    unittest.main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM