繁体   English   中英

HTML 的后排序深度优先搜索 (DFS),使用 python、lxml、etree

[英]postordering depth-first-search (DFS) of HTML, using python, lxml, etree

这不是 DFS 算法问题,也不是库建议问题。 它专门针对 lxml.etree (v 4)。 我使用 python 3.9。

这个库lxml.etree提供了一种遍历 ElmentTree 的方法,其中 HTML 代码被解析。 迭代器是 DFS,但是是预排序的(使用 DFS 上的 Wikipedia 文章中的术语)。 这意味着元素按第一次访问的顺序产生。 那么我的问题是实现后序迭代的简单方法是什么。

这是一个最小的代码,展示了iter()的默认顺序是预购。 我创建了一个虚拟函数,所以第二个测试显然失败了。 我需要_iter_postorder的实现才能使断言成立。

import unittest
from typing import List
from xml.etree.ElementTree import ElementTree

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def _iter_postorder(tree: ElementTree) -> List[str]:
    return []

class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postordering(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root) 
        result = _iter_postorder(tree)
        self.assertListEqual(result, ["c11", "c12", "c1"])

我自己编写了代码,使用递归和element.iterchildren ,但我很失望没有开箱即用的东西。 这是解决方案

import unittest
from typing import List, Union, Generator
from xml.etree.ElementTree import ElementTree, Element

from lxml import etree

HTML1 = """
        <div class="c1">
        <span class="c11">11</span>
        <span class="c12">12</span>
        </div>
        """


def iter_postorder(el: Element, result: List[Element] = []) -> List[Element]:
    for child in el.iterchildren():
        iter_postorder(child, result)
    result.append(el)
    return result


def iter_postorder2(el: Element) -> Generator[Element, Element, None]:
    for child in el.iterchildren():
        yield from iter_postorder2(child)
    yield el


class EtreeElementTests(unittest.TestCase):

    def test_dfs_preordering(self):
        """ regular iter() is dfs preordering"""
        root = etree.HTML(HTML1, etree.XMLParser())
        tree = ElementTree(root)
        result = [el.attrib['class'] for el in tree.iter()]
        self.assertListEqual(result, ["c1", "c11", "c12"])

    def test_dfs_postorder(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])

    def test_dfs_postorder_generator(self):
        root = etree.HTML(HTML1, etree.XMLParser())
        result = [el.attrib['class'] for el in iter_postorder2(root)]
        self.assertListEqual(result, ["c11", "c12", "c1"])


if __name__ == '__main__':
    unittest.main()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM