简体   繁体   English

如何使用 lxml 从 html 锚点中提取 href url?

[英]How to extract href url from html anchor using lxml?

I try to extract the next page href string using lxml.我尝试使用 lxml 提取下一页 href 字符串。

For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:例如,我尝试在以下示例中从 html 中提取“/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk”:

<nav rel="nav" class="pagination-container AjaxPager">
    <a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
    </a>
</nav>

I have tried the following but it returns a list not the string that I am looking for:我尝试了以下方法,但它返回的列表不是我要查找的字符串:

import requests
import lxml.html as html

URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)

tree = html.fromstring(page.content)

href = tree.xpath('//a/@href')

Any idea what I am doing wrong?知道我做错了什么吗?

Making this change to your code对您的代码进行此更改

href = tree.xpath('//a[@class="button button--primary next-page"]/@href')
href[0]

Gives me this output:给我这个 output:

'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'

which is close to the output in your question (its value may change dynamically).这与您问题中的 output 接近(其值可能会动态变化)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM