在爬行monsterindia.com时从刮y的外壳中获取空响应

Question

I am trying to crawl few pages from monsterindia.com. 我正在尝试从monsterindia.com抓取一些页面。 But whenever I write any xpath on scrapy shell, it gives me empty result. 但是，每当我在scrapy shell上编写任何xpath时，它都会给我空的结果。 However, there should be some way because view(response) command gives me the same html page. 但是，应该有某种方法，因为view（response）命令为我提供了相同的html页面。

I ran this command : 我运行了以下命令：

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

on my terminal and then tried several ways formulating different xpaths like - response.xpath('//*[@class="job-tittle"]/text()').extract() . 在我的终端上，然后尝试了几种方法来制定不同的xpath，例如-response.xpath response.xpath('//*[@class="job-tittle"]/text()').extract() 。 But no luck .. always got empty result. 但是，没有运气..总有空结果。

on terminal: 在终端上：

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

then, response.xpath('//div[@class="job-tittle"]/text()').extract() got empty result. 然后， response.xpath('//div[@class="job-tittle"]/text()').extract()得到空结果。

then, response.xpath('//*[@class="card-apply-content"]/text()').extract() got empty result. 然后， response.xpath('//*[@class="card-apply-content"]/text()').extract()得到空结果。

I expect it to give some results, I mean the text from the website after crawling. 我希望它能带来一些结果，我的意思是抓取后来自网站的文字。 Please help me with it. 请帮我。

Answer 1

The data you're looking for isn't on the home page, but in the responses retrieved after the page load. 您要查找的数据不在主页上，而是在页面加载后检索到的响应中。 If you check the " View Page Source " in your browser, you will see what actually came in the first request. 如果您在浏览器中选中“ 查看页面源 ”，则将看到第一个请求中实际包含的内容。

And by inspecting the network tab in dev tools, you will see the further requests, like the one to this URL: https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=25 通过检查开发工具中的“网络”标签，您将看到其他请求，例如对此URL的请求： https : //www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=25

Answer 2

So what Thiago I think was getting at is that the page updates with xhr requests which include a results count query string parameter. 因此，我认为Thiago遇到的问题是该页面使用xhr请求进行更新，其中包括结果计数查询字符串参数。 This returns json you can parse. 这将返回您可以解析的json。 So you change your url to that and handle json accordingly. 因此，您将网址更改为该网址并相应地处理json。

Using requests to demonstrate 使用请求进行演示

import requests
from bs4 import BeautifulSoup as bs
import json

r = requests.get('https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=100')
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('p').text)['jobSearchResponse']['data']

for item in data:
    print(item)

JSON of first item 第一项的JSON

https://jsoneditoronline.org/?id=fe49c53efe10423a8d49f9b5bdf4eb36 https://jsoneditoronline.org/?id=fe49c53efe10423a8d49f9b5bdf4eb36

With scrapy: 带有刮擦：

jsonres = json.loads(response.body_as_unicode()

在爬行monsterindia.com时从刮y的外壳中获取空响应

问题描述

2 个解决方案

解决方案1
2 2019-04-12 16:56:44

解决方案2
1 已采纳 2019-04-12 17:44:13

在爬行monsterindia.com时从刮y的外壳中获取空响应

问题描述

2 个解决方案

解决方案1 2 2019-04-12 16:56:44

解决方案2 1 已采纳 2019-04-12 17:44:13

解决方案1
2 2019-04-12 16:56:44

解决方案2
1 已采纳 2019-04-12 17:44:13