简体   繁体   English

在元素检查器中工作时,使用xpath从scrapy shell获得空响应

[英]Getting an empty response from scrapy shell using xpath, while it works in element inspector

I am trying to scrape this webpage (for educational purposes). 我正在尝试抓取此网页 (出于教育目的)。

When I extract the xpath, and try it in element inspector in browser, it works. 当我提取xpath并在浏览器的元素检查器中尝试时,它可以工作。 For example to get the address, I use the xpath below: 例如,要获取地址,我使用以下xpath:

//div[@class="address-coords"]/div[@class="address"]/p/span[@itemprop="address"]

Meanwhile, in scrapy shell, it does not work: 同时,在刮板外壳中,它不起作用:

$ scrapy shell 'https://cloud.baladovore.com/map/sNRgAcGKiY' -s U
SER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, l
ike Gecko) Chrome/46.0.2490.80 Safari/537.36'

In [5]: response.xpath('//div[@class="address-coords"]/div[@class="address"]/p/span[@it
   ...: emprop="address"]').getall()

Out[5]: []

I get an empty list, although the responses is 200: 我得到一个空列表,尽管响应为200:

In [6]: response
Out[6]: <200 https://cloud.baladovore.com/map/008jPJuORI>

I already tried all suggestions I found in Internet. 我已经尝试了所有在Internet上找到的建议。 Like changing the user agent, setting ROBOTSTXT_OBEY to False, and increasing the delay. 就像更改用户代理一样,将ROBOTSTXT_OBEY设置为False,并增加延迟。 I would really appreciate it if someone helped me solve this problem, since I was working on it for days. 如果有人帮助我解决了这个问题,我将不胜感激,因为我已经工作了好几天。

If you use the scrapy shell to look at the response's content (with response.body ) you'll see that the server responds with a small page full of scripts that are then executed. 如果您使用scrapy shell(使用response.body )查看响应的内容,您会看到服务器以一小页的脚本进行响应,然后执行这些脚本。

So you either need a way to run Javascript with Scrapy or to directry query the server to get the results. 因此,您需要一种使用Scrapy运行Javascript或直接查询服务器以获取结果的方法。 Using the browser's Dev tools (Network) is a common way to inspect those queries (as described in the linked answer ). 使用浏览器的开发工具(网络)是检查这些查询的常用方法(如链接的答案所述 )。

Another solution is to use Selenium to simulate a full browser. 另一个解决方案是使用Selenium模拟完整的浏览器。

Edit 1: You need to go further than just https://cloud.baladovore.com/parse/classes/Address . 编辑1:您需要做的不仅仅是https://cloud.baladovore.com/parse/classes/Address

If you inspect the request, you'll see that it not only requests that page, but also supplies additional infomation: 如果检查请求,您将看到它不仅请求该页面,而且还提供其他信息:

Request URL: https://cloud.baladovore.com/parse/classes/Address 要求网址: https : //cloud.baladovore.com/parse/classes/Address

Request Method: POST 请求方法:POST

Request Payload: {"where":{"objectId":"sNRgAcGKiY"},"limit":1,"_method":"GET","_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX","_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u","_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"} 请求有效负载:{“ where”:{“ objectId”:“ sNRgAcGKiY”},“ limit”:1,“ _ method”:“ GET”,“ _ ApplicationId”:“ cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX”,“ _ JavaScriptKey” vs_Single“ 9” “:” js1.6.14“,” _ InstallationId“:” 02f7b7dd-31c7-b235-df1d-93c323dbcd60“}

Let's try simulating that with requests : 让我们尝试使用requests模拟它:

import requests

access_data = {"where":{"objectId":"sNRgAcGKiY"},
"limit":1,
"_method":"GET",
"_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX",
"_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u",
"_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"
}
url = 'https://cloud.baladovore.com/parse/classes/Address'
test_req = requests.post(url, json=access_data)
test_req.status_code
test_req.json()

This outputs the decoded json response that you can work with. 这将输出您可以使用的解码的json响应。

I do not know _JavaScriptKey 's properties. 我不知道_JavaScriptKey的属性。 You will need to look into that. 您将需要调查。

If you insist on using Srapy you will need to read the documentation on how to set request bodies. 如果您坚持使用Srapy,则需要阅读有关如何设置请求正文的文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM