如何使用 lxml 从 html 锚点中提取 href url？

Question

I try to extract the next page href string using lxml.我尝试使用 lxml 提取下一页 href 字符串。

For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:例如，我尝试在以下示例中从 html 中提取“/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk”：

<nav rel="nav" class="pagination-container AjaxPager">
    <a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
    </a>
</nav>

I have tried the following but it returns a list not the string that I am looking for:我尝试了以下方法，但它返回的列表不是我要查找的字符串：

import requests
import lxml.html as html

URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)

tree = html.fromstring(page.content)

href = tree.xpath('//a/@href')

Any idea what I am doing wrong?知道我做错了什么吗？

Answer 1

Making this change to your code对您的代码进行此更改

href = tree.xpath('//a[@class="button button--primary next-page"]/@href')
href[0]

Gives me this output:给我这个 output：

'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'

which is close to the output in your question (its value may change dynamically).这与您问题中的 output 接近（其值可能会动态变化）。

如何使用 lxml 从 html 锚点中提取 href url？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-03 18:51:38

如何使用 lxml 从 html 锚点中提取 href url？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-03 18:51:38

解决方案1
2 已采纳 2021-05-03 18:51:38