[英]How to extract href url from html anchor using lxml?
I try to extract the next page href string using lxml.我尝试使用 lxml 提取下一页 href 字符串。
For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:例如,我尝试在以下示例中从 html 中提取“/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk”:
<nav rel="nav" class="pagination-container AjaxPager">
<a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
</a>
</nav>
I have tried the following but it returns a list not the string that I am looking for:我尝试了以下方法,但它返回的列表不是我要查找的字符串:
import requests
import lxml.html as html
URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)
tree = html.fromstring(page.content)
href = tree.xpath('//a/@href')
Any idea what I am doing wrong?知道我做错了什么吗?
Making this change to your code对您的代码进行此更改
href = tree.xpath('//a[@class="button button--primary next-page"]/@href')
href[0]
Gives me this output:给我这个 output:
'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'
which is close to the output in your question (its value may change dynamically).这与您问题中的 output 接近(其值可能会动态变化)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.