Python：使用 lxml xpath 从所有 HTML 子元素文本中获取文本

Question

我正在使用 python 的 lxml xpath。 如果我提供 HTML 标签的完整路径，我就可以提取文本。 但是我无法从标签中提取所有文本，并将其子元素提取到列表中。 因此，例如给定此 html，我想获取“示例”类的所有文本：

<div class="example">
    "Some text"
    <div>
        "Some text 2"
        <p>"Some text 3"</p>
        <p>"Some text 4"</p>
        <span>"Some text 5"</span>
    </div>
    <p>"Some text 6"</p> 
</div>

我想得到：

["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

Answer 1

mzjn-s anwer 是正确的。 经过一些试验和错误，我设法让它工作。 这就是最终代码的样子。 您需要将//text()放在 xpath 的末尾。 暂时没有重构，所以肯定会有一些错误和不好的做法，但它是有效的。

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

我已经在 keeleyteton.com 的团队介绍页面上试过了。 它返回了以下正确的列表（尽管需要大量修改！），因为它们位于不同的标签中，有些是子标签。 感谢您的帮助！

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']

Python：使用 lxml xpath 从所有 HTML 子元素文本中获取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-01 11:59:43

Python：使用 lxml xpath 从所有 HTML 子元素文本中获取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-01 11:59:43

解决方案1
0 已采纳 2020-09-01 11:59:43