如何将文本和 xpath 提取到 Python 中 HTML 页面的那个元素

Question

I am working on a Django project where I need to extract all the text-containing elements and the xPath to that element.我正在处理一个 Django 项目，我需要将所有包含文本的元素和 xPath 提取到该元素。 EG:例如：

<html>
<head>
    <title>
        The Demo page
    </title>
</head>

<body>
    <div>
        <section>
            <h1> Hello world
            </h1>
        </section>
        <div>
            <p>
                Hope you all are doing well,
            </p>
        </div>
        <div>
            <p>
                This is the example HTML
            </p>
        </div>
    </div>
</body>
</html>

The output should be something like: output 应该是这样的：

/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML

Answer 1

Something like this should work:这样的事情应该有效：

from lxml import etree
html = """[your html above]"""

root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)

for target in targets:
    print(tree.getpath(target),target.text.strip())

Output: Output：

/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML

如何将文本和 xpath 提取到 Python 中 HTML 页面的那个元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-01 12:14:20

如何将文本和 xpath 提取到 Python 中 HTML 页面的那个元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-01 12:14:20

解决方案1
1 已采纳 2020-12-01 12:14:20