简体   繁体   English

HTML 的后排序深度优先搜索 (DFS),使用 python、lxml、etree

[英]postordering depth-first-search (DFS) of HTML, using python, lxml, etree

This is not an DFS algorithm question, or library-suggestion question.这不是 DFS 算法问题,也不是库建议问题。 It is specifically about lxml.etree (v 4).它专门针对 lxml.etree (v 4)。 I use python 3.9.我使用 python 3.9。

This library, lxml.etree , provides a way to iterate over the ElmentTree into which an HTML code is parsed.这个库lxml.etree提供了一种遍历 ElmentTree 的方法,其中 HTML 代码被解析。 The iterator is DFS, but preordering (using the term from Wikipedia article on DFS).迭代器是 DFS,但是是预排序的(使用 DFS 上的 Wikipedia 文章中的术语)。 It means the elements are yielded in the order of first visit.这意味着元素按第一次访问的顺序产生。 My question is then what is the easy way to implement the postorder iteration.那么我的问题是实现后序迭代的简单方法是什么。

Here is a minimal code demonstrating that the default order of iter() is the pre-order.这是一个最小的代码,展示了iter()的默认顺序是预购。 I created a dummy funciton so the second test obviously fails.我创建了一个虚拟函数,所以第二个测试显然失败了。 I need an implementaiton of _iter_postorder for the assertion to hold true.我需要_iter_postorder的实现才能使断言成立。

import unittest
from typing import List
from xml.etree.ElementTree import ElementTree

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def _iter_postorder(tree: ElementTree) -> List[str]:
    return []

class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postordering(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root) 
        result = _iter_postorder(tree)
        self.assertListEqual(result, ["c11", "c12", "c1"])

I have coded it myself, using recursion and element.iterchildren , but I am disappointed there is nothing out of the box.我自己编写了代码,使用递归和element.iterchildren ,但我很失望没有开箱即用的东西。 Here is the solution这是解决方案

import unittest
from typing import List, Union, Generator
from xml.etree.ElementTree import ElementTree, Element

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def iter_postorder(el: Element, result: List[Element] = []) -> List[Element]:
    for child in el.iterchildren():
        iter_postorder(child, result)
    result.append(el)
    return result


def iter_postorder2(el: Element) -> Generator[Element, Element, None]:
    for child in el.iterchildren():
        yield from iter_postorder2(child)
    yield el


class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postorder(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])

    def test_dfs_postorder_generator(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder2(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])


if __name__ == '__main__':
    unittest.main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM