繁体   English   中英

使用lxml的Python脚本,xpath返回空列表

[英]Python script using lxml, xpath returning null list

我试图使用带有lxml的xpath从html标记中取消href链接。 但是xpath返回的是null列表,而它是经过单独测试的,似乎可以正常工作。

该代码返回一个空值,而xpath似乎可以正常工作。

page = self.opener.open(link).read()
doc=html.fromstring(str(page))
ref = doc.xpath('//ul[@class="s-result-list s-col-1 s-col-ws-1 s-result-list-hgrid s-height-equalized s-list-view s-text-condensed s-item-container-height-auto"]/li/div/div[@class="a-fixed-left-grid"]/div/div[@class="a-fixed-left-grid-col a-col-left"]/div/div/a')
for post in ref:
    print(post.get("href"))

我使用的是代理服务器,用于访问链接,并且似乎可以正常工作,因为“ doc”变量中填充了html内容。 我已经检查了链接,并在正确的页面上以获取此xpath。

Xpath返回链接

这是我尝试从中获取数据的链接: https : //www.amazon.com/s/ref=lp_266162_nr_n_0?fst= as%3Aoff &rh=n% 3A283155%2Cn%3A%211000%2Cn% 3A1% 2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn = 266162&ie = UTF8&qid = 1550120216&rnid = 266162

您的xpath选择器无效。 尝试如下所示的css selctor

import requests
import lxml, lxml.html

url = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
r = requests.get(url)
html = lxml.html.fromstring(r.content)
links = html.cssselect('.a-fixed-left-grid-col .a-col-left a')
for link in links:
    print(link.attrib['href'])

输出

https://www.amazon.com/Top-500-Instant-Pot-Recipes/dp/1730885209
https://www.amazon.com/Monthly-Budget-Planner-Organizer-Notebook/dp/1978202865
https://www.amazon.com/Edge-Order-Daniel-Libeskind/dp/045149735X
https://www.amazon.com/Man-Glass-House-Johnson-Architect/dp/0316126438
https://www.amazon.com/Versailles-Private-Invitation-Guillaume-Picon/dp/2080203371
https://www.amazon.com/Palm-Springs-Modernist-Tim-Street-Porter/dp/0847861872
https://www.amazon.com/Building-Chicago-Architectural-John-Zukowsky/dp/0847848701
https://www.amazon.com/Taverns-American-Revolution-Adrian-Covert/dp/160887785X
https://www.amazon.com/TRAVEL-MOSAIC-Color-Number-Relaxation/dp/1717562221
https://www.amazon.com/Understanding-Cemetery-Symbols-Historic-Graveyards/dp/1547047216
https://www.amazon.com/Soviet-Bus-Stops-Christopher-Herwig/dp/099319110X
https://www.amazon.com/Famous-Movie-Scenes-Dot-Dot/dp/1977747043

点要求

certifi==2018.11.29
chardet==3.0.4
cssselect==1.0.3
idna==2.8
lxml==4.3.1
requests==2.21.0
urllib3==1.24.1

我想您是在Books : Arts & Photography : Architecture : Buildings : Landmarks & Monuments的链接后面的。 我在脚本中使用了xpath来获取链接。 搏一搏:

import requests
from lxml.html import fromstring

link = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
r = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
htmlcontent = fromstring(r.text)
itemlinks = htmlcontent.xpath('//*[@id="mainResults"]//*[contains(@class,"s-access-detail-page")]')
for link in itemlinks:
    print(link.get('href'))

如果您想使用css选择器,则应该可以进行以下操作:

'#mainResults .s-access-detail-page'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM