简体   繁体   中英

Getting an empty response from scrapy shell using xpath, while it works in element inspector

I am trying to scrape this webpage (for educational purposes).

When I extract the xpath, and try it in element inspector in browser, it works. For example to get the address, I use the xpath below:

//div[@class="address-coords"]/div[@class="address"]/p/span[@itemprop="address"]

Meanwhile, in scrapy shell, it does not work:

$ scrapy shell 'https://cloud.baladovore.com/map/sNRgAcGKiY' -s U
SER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, l
ike Gecko) Chrome/46.0.2490.80 Safari/537.36'

In [5]: response.xpath('//div[@class="address-coords"]/div[@class="address"]/p/span[@it
   ...: emprop="address"]').getall()

Out[5]: []

I get an empty list, although the responses is 200:

In [6]: response
Out[6]: <200 https://cloud.baladovore.com/map/008jPJuORI>

I already tried all suggestions I found in Internet. Like changing the user agent, setting ROBOTSTXT_OBEY to False, and increasing the delay. I would really appreciate it if someone helped me solve this problem, since I was working on it for days.

If you use the scrapy shell to look at the response's content (with response.body ) you'll see that the server responds with a small page full of scripts that are then executed.

So you either need a way to run Javascript with Scrapy or to directry query the server to get the results. Using the browser's Dev tools (Network) is a common way to inspect those queries (as described in the linked answer ).

Another solution is to use Selenium to simulate a full browser.

Edit 1: You need to go further than just https://cloud.baladovore.com/parse/classes/Address .

If you inspect the request, you'll see that it not only requests that page, but also supplies additional infomation:

Request URL: https://cloud.baladovore.com/parse/classes/Address

Request Method: POST

Request Payload: {"where":{"objectId":"sNRgAcGKiY"},"limit":1,"_method":"GET","_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX","_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u","_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"}

Let's try simulating that with requests :

import requests

access_data = {"where":{"objectId":"sNRgAcGKiY"},
"limit":1,
"_method":"GET",
"_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX",
"_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u",
"_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"
}
url = 'https://cloud.baladovore.com/parse/classes/Address'
test_req = requests.post(url, json=access_data)
test_req.status_code
test_req.json()

This outputs the decoded json response that you can work with.

I do not know _JavaScriptKey 's properties. You will need to look into that.

If you insist on using Srapy you will need to read the documentation on how to set request bodies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM