[英]scrapy.Request() prevents me from stepping into my function
Hello everyone~ I am new to Scrapy and I encountered a very strange problem. 大家好〜我是Scrapy的新手,遇到了一个非常奇怪的问题。 Briefly speaking, I find that scrapy.Request() prevents me from stepping into my function.
简而言之,我发现scrapy.Request()阻止我进入函数。
Here is my Code: 这是我的代码:
# -*- coding: utf-8 -*-
import scrapy
from tutor_job_spy.items import TutorJobSpyItem
class Spyspider(scrapy.Spider):
name = 'spy'
#for privacy reasons I delete the url information :)
allowed_domains = ['']
url_0 = ''
start_urls = [url_0, ]
base_url = ''
list_previous = []
list_present = []
def parse(self, response):
numbers = response.xpath( '//tr[@bgcolor="#d7ecff" or @bgcolor="#eef7ff"]/td[@width="8%" and @height="40"]/span/text()').extract()
self.list_previous = numbers
self.list_present = numbers
yield scrapy.Request(self.url_0, self.keep_spying)
def keep_spying(self, response):
numbers = response.xpath('//tr[@bgcolor="#d7ecff" or @bgcolor="#eef7ff"]/td[@width="8%" and @height="40"]/span/text()').extract()
self.list_previous = self.list_present
self.list_present = numbers
# judge if anything new
if (self.list_present != self.list_previous):
self.goto_new_demand(response)
#time.sleep(60) #from cache
yield scrapy.Request(self.url_0, self.keep_spying, dont_filter=True)
def goto_new_demand(self, response):
new_demand_links = []
detail_links = response.xpath('//div[@class="ShowDetail"]/a/@href').extract()
for i in range(len(self.list_present)):
if (self.list_present[ i] not in self.list_previous):
new_demand_links.append(self.base_url + detail_links[i])
if (new_demand_links != []):
for new_demand_link in new_demand_links:
yield scrapy.Request(new_demand_link, self.get_new_demand)
def get_new_demand(self, response):
new_demand = TutorJobSpyItem()
new_demand['url'] = response.url
requirments = response.xpath('//tr[@#bgcolor="#eef7ff"]/td[@colspan="2"]/div/text()').extract()[0]
new_demand['gender'] = self.get_gender(requirments)
new_demand['region'] = response.xpath('//tr[@bgcolor="#d7ecff"]/td[@align="left"]/text()').extract()[5]
new_demand['grade'] = response.xpath('//tr[@bgcolor="#d7ecff"]/td[@align="left"]/text()').extract()[7]
new_demand['subject'] = response.xpath('//tr[@bgcolor="#eef7ff"]/td[@align="left"]/text()').extract()[2]
return new_demand
def get_gender(self, requirments):
if ('女老师' in requirments):
return 'F'
elif ('男老师' in requirments):
return 'M'
else:
return 'Both okay'
The problem is that when I debug, I find that I cannot step into goto_new_demand : 问题是,当我调试时,发现无法进入goto_new_demand :
if (self.list_present != self.list_previous):
self.goto_new_demand(response)
Every time I run the script or debug it, it just skip goto_new_demand , but after I comment yield scrapy.Request(new_demand_link, self.get_new_demand)
in goto_new_demand and then I can step into it. 我每次运行该脚本或调试它,它只是跳过goto_new_demand,但经过我的评论
yield scrapy.Request(new_demand_link, self.get_new_demand)
在goto_new_demand,然后我可以进入它。 I have tried many times and found that I can step into goto_new_demand only when there is no yyield scrapy.Request(new_demand_link, self.get_new_demand)
in it. 我已经尝试了很多次,发现只有当其中没有
yyield scrapy.Request(new_demand_link, self.get_new_demand)
时,我才能进入goto_new_demand 。 Why that happens? 为什么会这样?
Thanks in advance to anyone who can give an advice :) 在此先感谢任何可以提供建议的人:)
PS: PS:
Scrapy : 1.5.0 Scrapy的:1.5.0
lxml : 4.1.1.0 lxml:4.1.1.0
libxml2 : 2.9.5 libxml2:2.9.5
cssselect : 1.0.3 cssselect:1.0.3
parsel : 1.3.1 解析度:1.3.1
w3lib : 1.18.0 w3lib:1.18.0
Twisted : 17.9.0 扭曲:17.9.0
Python : 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] Python:3.6.3(v3.6.3:2c5fed8,2017年10月3日,18:11:49)[MSC v.1900 64位(AMD64)]
pyOpenSSL : 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017) pyOpenSSL:17.5.0(OpenSSL 1.1.0g 2017年11月2日)
cryptography : 2.1.4 密码学:2.1.4
Platform : Windows-7-6.1.7601-SP1 平台:Windows-7-6.1.7601-SP1
Problem solved! 问题解决了!
I modified the generator goto_new_demand into function goto_new_demand . 我将生成器 goto_new_demand修改为函数 goto_new_demand 。 So the problem is totally result from my little comprehension of yield an generator .
因此问题完全是由于我对发电机 产量的一点理解。
Here is the code modified: 这是修改后的代码:
if (self.list_present != self.list_previous):
# yield self.goto_new_demand(response)
new_demand_links = self.goto_new_demand(response)
if (new_demand_links != []):
for new_demand_link in new_demand_links:
yield scrapy.Request(new_demand_link, self.get_new_demand)
def goto_new_demand(self, response):
new_demand_links = []
detail_links = response.xpath('//div[@class="ShowDetail"]/a/@href').extract()
for i in range(len(self.list_present)):
if (self.list_present[ i] not in self.list_previous):
new_demand_links.append(self.base_url + detail_links[i])
return new_demand_links
The reason lies in the answer from Ballack. 原因在于巴拉克的答案。
The correct way to debug Scrapy spiders is described in the documentation . 文档中介绍了调试Scrapy Spider的正确方法。 Especially useful technique is using Scrapy Shell to inspect the responses.
尤其有用的技术是使用Scrapy Shell检查响应。
I think you may need to change this statement 我认为您可能需要更改此声明
if (self.list_present != self.list_previous):
self.goto_new_demand(response)
to: 至:
if (self.list_present != self.list_previous):
yield self.goto_new_demand(response)
because the self.goto_new_demand()
is just a generator(which have yield statement in the function), so simply using self.goto_new_demand(response)
will not make anything runs. 因为
self.goto_new_demand()
只是一个生成器(该函数在函数中具有yield语句),所以仅使用self.goto_new_demand(response)
不会使任何运行。
A simple example for the generator may make you more clear about this: 生成器的一个简单示例可以使您对此更加清楚:
def a():
print("hello")
# invoke a will print out hello
a()
but for a generator, simply invoke this will return just a generator: 但对于生成器,只需调用此函数将仅返回生成器:
def a():
yield
print("hello")
# invoke a will not print out hello, instead it will return a generator object
a()
So, in scrapy, you should use yield self.goto_new_demand(response)
to make goto_new_demand(response)
actually runs. 因此,要抓紧时间,您应该使用
yield self.goto_new_demand(response)
使goto_new_demand(response)
实际运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.