简体   繁体   English

使用 lxml、xpath 和 css 选择器的 Python 脚本也返回空列表

[英]Python script using lxml, xpath and css selector also returning null list

I tried to scrap href link for the next page from an html tag using xpath with lxml.我尝试使用 xpath 和 lxml 从 html 标记中删除下一页的 href 链接。 But the xpath is returning null list whereas it was tested separately and it seems to work.但是 xpath 返回空列表,而它是单独测试的,它似乎可以工作。

I've tried both css selector and xpath, both of them are returning null list.我已经尝试了 css 选择器和 xpath,它们都返回空列表。

The code is returning a null value whereas the xpath seems to work fine.代码返回空值,而 xpath 似乎工作正常。

import sys
import time
import urllib.request
import random
from lxml import html 
import lxml.html 
import csv,os,json
import requests
from time import sleep
from lxml import etree

username = 'username'
password = 'password'
port = port
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%s@zproxy.lum-superproxy.io:%d' %(username, session_id, password, port))
proxy_handler = urllib.request.ProxyHandler({
        'http': super_proxy_url,
        'https': super_proxy_url,})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = \[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
print('Performing request')

page = self.opener.open("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588").read()
pageR = requests.get("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588",headers={"User-Agent":"Mozilla/5.0"})

doc=html.fromstring(str(pageR))

html = lxml.html.fromstring(str(page))
links = html.cssselect('#pagnNextLink')
for link in links:
        print(link.attrib['href'])

linkRef = doc.xpath("//a[@id='pagnNextLink']/@href")
print(linkRef)
for post in linkRef:
    link="https://www.amazon.com%s" % post

I've tried two ways here and both of them seems to not work.我在这里尝试了两种方法,但它们似乎都不起作用。

I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content.我正在使用代理服务器来访问链接并且它似乎可以工作,因为“doc”变量正在填充 html 内容。 I've checked the links and I'm on the proper page to fetch this xpath/csslink.我检查了链接,我在正确的页面上获取这个 xpath/csslink。

xpath 和 css 验证

Someone more experienced may give better advice on working with your set-up so I will simply indicate what I experienced:更有经验的人可能会就您的设置提供更好的建议,因此我将简单说明我的经历:

When I used requests I sometimes got the link and sometimes not.当我使用requests我有时会得到链接,有时不会。 When not, the response indicated it was checking I was not a bot and to ensure my browser allowed cookies.如果不是,响应表明它正在检查我不是机器人并确保我的浏览器允许 cookie。

With selenium I reliably got a result in my tests, though this may not be quick enough, or an option for you for other reasons.使用硒,我在测试中可靠地得到了结果,尽管这可能不够快,或者出于其他原因对您来说是一种选择。

from selenium import webdriver
d = webdriver.Chrome()
url = 'https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588'
d.get(url)
link = d.find_element_by_id('pagnNextLink').get_attribute('href')
print(link)

Selenium with proxy (Firefox):带有代理的硒(Firefox):

Running Selenium Webdriver with a proxy in Python 在 Python 中使用代理运行 Selenium Webdriver

Selenium with proxy (Chrome) - covered nicely here:带有代理的硒(Chrome) - 在这里很好地介绍:

https://stackoverflow.com/a/11821751/6241235 https://stackoverflow.com/a/11821751/6241235

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM