在元素检查器中工作时，使用xpath从scrapy shell获得空响应

Question

I am trying to scrape this webpage (for educational purposes). 我正在尝试抓取此网页（出于教育目的）。

When I extract the xpath, and try it in element inspector in browser, it works. 当我提取xpath并在浏览器的元素检查器中尝试时，它可以工作。 For example to get the address, I use the xpath below: 例如，要获取地址，我使用以下xpath：

//div[@class="address-coords"]/div[@class="address"]/p/span[@itemprop="address"]

Meanwhile, in scrapy shell, it does not work: 同时，在刮板外壳中，它不起作用：

$ scrapy shell 'https://cloud.baladovore.com/map/sNRgAcGKiY' -s U
SER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, l
ike Gecko) Chrome/46.0.2490.80 Safari/537.36'

In [5]: response.xpath('//div[@class="address-coords"]/div[@class="address"]/p/span[@it
   ...: emprop="address"]').getall()

Out[5]: []

I get an empty list, although the responses is 200: 我得到一个空列表，尽管响应为200：

In [6]: response
Out[6]: <200 https://cloud.baladovore.com/map/008jPJuORI>

I already tried all suggestions I found in Internet. 我已经尝试了所有在Internet上找到的建议。 Like changing the user agent, setting ROBOTSTXT_OBEY to False, and increasing the delay. 就像更改用户代理一样，将ROBOTSTXT_OBEY设置为False，并增加延迟。 I would really appreciate it if someone helped me solve this problem, since I was working on it for days. 如果有人帮助我解决了这个问题，我将不胜感激，因为我已经工作了好几天。

Answer 1

If you use the scrapy shell to look at the response's content (with response.body ) you'll see that the server responds with a small page full of scripts that are then executed. 如果您使用scrapy shell（使用response.body ）查看响应的内容，您会看到服务器以一小页的脚本进行响应，然后执行这些脚本。

So you either need a way to run Javascript with Scrapy or to directry query the server to get the results. 因此，您需要一种使用Scrapy运行Javascript或直接查询服务器以获取结果的方法。 Using the browser's Dev tools (Network) is a common way to inspect those queries (as described in the linked answer ). 使用浏览器的开发工具（网络）是检查这些查询的常用方法（如链接的答案所述）。

Another solution is to use Selenium to simulate a full browser. 另一个解决方案是使用Selenium模拟完整的浏览器。

Edit 1: You need to go further than just https://cloud.baladovore.com/parse/classes/Address . 编辑1：您需要做的不仅仅是https://cloud.baladovore.com/parse/classes/Address 。

If you inspect the request, you'll see that it not only requests that page, but also supplies additional infomation: 如果检查请求，您将看到它不仅请求该页面，而且还提供其他信息：

Request URL: https://cloud.baladovore.com/parse/classes/Address 要求网址： https : //cloud.baladovore.com/parse/classes/Address

Request Method: POST 请求方法：POST

Request Payload: {"where":{"objectId":"sNRgAcGKiY"},"limit":1,"_method":"GET","_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX","_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u","_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"} 请求有效负载：{“ where”：{“ objectId”：“ sNRgAcGKiY”}，“ limit”：1，“ _ method”：“ GET”，“ _ ApplicationId”：“ cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX”，“ _ JavaScriptKey” vs_Single“ 9” “：” js1.6.14“，” _ InstallationId“：” 02f7b7dd-31c7-b235-df1d-93c323dbcd60“}

Let's try simulating that with requests : 让我们尝试使用requests模拟它：

import requests

access_data = {"where":{"objectId":"sNRgAcGKiY"},
"limit":1,
"_method":"GET",
"_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX",
"_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u",
"_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"
}
url = 'https://cloud.baladovore.com/parse/classes/Address'
test_req = requests.post(url, json=access_data)
test_req.status_code
test_req.json()

This outputs the decoded json response that you can work with. 这将输出您可以使用的解码的json响应。

I do not know _JavaScriptKey 's properties. 我不知道_JavaScriptKey的属性。 You will need to look into that. 您将需要调查。

If you insist on using Srapy you will need to read the documentation on how to set request bodies. 如果您坚持使用Srapy，则需要阅读有关如何设置请求正文的文档。

在元素检查器中工作时，使用xpath从scrapy shell获得空响应

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-12 10:01:56

在元素检查器中工作时，使用xpath从scrapy shell获得空响应

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-12 10:01:56

解决方案1
0 已采纳 2019-07-12 10:01:56