简体   繁体   English

使用lxml的Python脚本,xpath返回空列表

[英]Python script using lxml, xpath returning null list

I tried to scrap href links from an html tag using xpath with lxml. 我试图使用带有lxml的xpath从html标记中取消href链接。 But the xpath is returning null list whereas it was tested separately and it seems to work. 但是xpath返回的是null列表,而它是经过单独测试的,似乎可以正常工作。

The code is returning a null value whereas the xpath seems to work fine. 该代码返回一个空值,而xpath似乎可以正常工作。

page = self.opener.open(link).read()
doc=html.fromstring(str(page))
ref = doc.xpath('//ul[@class="s-result-list s-col-1 s-col-ws-1 s-result-list-hgrid s-height-equalized s-list-view s-text-condensed s-item-container-height-auto"]/li/div/div[@class="a-fixed-left-grid"]/div/div[@class="a-fixed-left-grid-col a-col-left"]/div/div/a')
for post in ref:
    print(post.get("href"))

I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content. 我使用的是代理服务器,用于访问链接,并且似乎可以正常工作,因为“ doc”变量中填充了html内容。 I've checked the links and I'm on the proper page to fetch this xpath. 我已经检查了链接,并在正确的页面上以获取此xpath。

Xpath返回链接

This is the link from which I'm trying to fetch data: https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162 这是我尝试从中获取数据的链接: https : //www.amazon.com/s/ref=lp_266162_nr_n_0?fst= as%3Aoff &rh=n% 3A283155%2Cn%3A%211000%2Cn% 3A1% 2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn = 266162&ie = UTF8&qid = 1550120216&rnid = 266162

Your xpath selector is invalid. 您的xpath选择器无效。 try css selctor like below 尝试如下所示的css selctor

import requests
import lxml, lxml.html

url = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
r = requests.get(url)
html = lxml.html.fromstring(r.content)
links = html.cssselect('.a-fixed-left-grid-col .a-col-left a')
for link in links:
    print(link.attrib['href'])

output 输出

https://www.amazon.com/Top-500-Instant-Pot-Recipes/dp/1730885209
https://www.amazon.com/Monthly-Budget-Planner-Organizer-Notebook/dp/1978202865
https://www.amazon.com/Edge-Order-Daniel-Libeskind/dp/045149735X
https://www.amazon.com/Man-Glass-House-Johnson-Architect/dp/0316126438
https://www.amazon.com/Versailles-Private-Invitation-Guillaume-Picon/dp/2080203371
https://www.amazon.com/Palm-Springs-Modernist-Tim-Street-Porter/dp/0847861872
https://www.amazon.com/Building-Chicago-Architectural-John-Zukowsky/dp/0847848701
https://www.amazon.com/Taverns-American-Revolution-Adrian-Covert/dp/160887785X
https://www.amazon.com/TRAVEL-MOSAIC-Color-Number-Relaxation/dp/1717562221
https://www.amazon.com/Understanding-Cemetery-Symbols-Historic-Graveyards/dp/1547047216
https://www.amazon.com/Soviet-Bus-Stops-Christopher-Herwig/dp/099319110X
https://www.amazon.com/Famous-Movie-Scenes-Dot-Dot/dp/1977747043

pip requirements 点要求

certifi==2018.11.29
chardet==3.0.4
cssselect==1.0.3
idna==2.8
lxml==4.3.1
requests==2.21.0
urllib3==1.24.1

I suppose you are after the links within Books : Arts & Photography : Architecture : Buildings : Landmarks & Monuments . 我想您是在Books : Arts & Photography : Architecture : Buildings : Landmarks & Monuments的链接后面的。 I used xpath within the script to fetch the links. 我在脚本中使用了xpath来获取链接。 Give it a go: 搏一搏:

import requests
from lxml.html import fromstring

link = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
r = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
htmlcontent = fromstring(r.text)
itemlinks = htmlcontent.xpath('//*[@id="mainResults"]//*[contains(@class,"s-access-detail-page")]')
for link in itemlinks:
    print(link.get('href'))

If you wanted to go for css selector, then the following should work: 如果您想使用css选择器,则应该可以进行以下操作:

'#mainResults .s-access-detail-page'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM