简体   繁体   English

如何将文本和 xpath 提取到 Python 中 HTML 页面的那个元素

[英]How to extract text and the xpath to that element of the HTML page in Python

I am working on a Django project where I need to extract all the text-containing elements and the xPath to that element.我正在处理一个 Django 项目,我需要将所有包含文本的元素和 xPath 提取到该元素。 EG:例如:

<html>
<head>
    <title>
        The Demo page
    </title>
</head>

<body>
    <div>
        <section>
            <h1> Hello world
            </h1>
        </section>
        <div>
            <p>
                Hope you all are doing well,
            </p>
        </div>
        <div>
            <p>
                This is the example HTML
            </p>
        </div>
    </div>
</body>
</html>

The output should be something like: output 应该是这样的:

/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML

Something like this should work:这样的事情应该有效:

from lxml import etree
html = """[your html above]"""

root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)

for target in targets:
    print(tree.getpath(target),target.text.strip())

Output: Output:

/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将包含内联元素和xpath的文本提取到Python中HTML页面的那个元素 - How to extract text including inline elements and the xpath to that element of the HTML page in Python 如何在python中使用Xpath提取HTML标签中的元素? - How to extract element within a HTML Tag using Xpath in python? 如何通过xpath提取html dom中文本节点的文本? - How to extract the text of a text node within an html dom through xpath? 无法从python中的html页面提取文本 - Unable extract text from html page in python 如何从没有xpath的元素中提取文本 - How to extract text from an element that does not have an xpath xpath:如何在<strong>元素之前,之后和之后提取文本 - xpath: how to extract text before, AND within, AND after the <strong> element 如何在python中的html页面上运行xpath? - How to run an xpath over html page in python? Python3 Selenium - 无法从 HTML 页面中的元素中提取文本值(网页抓取) - Python3 Selenium - Failed to extract the text value from an element in a HTML page (web scraping) 硒python xpath在页面上找不到元素text() - selenium python xpath trouble finding element text() on page 如何使用lxml xpath和python中的请求在文本中提取href - How to extract the href within the text using lxml xpath and requests in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM