[英]Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback
Basically the code below scrapes the first 5 items of a table. 基本上,下面的代码将刮擦表的前5个项目。 One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item.
字段之一是另一个href,单击该href提供了我想收集并添加到原始项目的更多信息。 So
parse
is supposed to pass the semi populated item to parse_next_page
which then scrapes the next bit and should return the completed item
back to parse
因此,
parse
应该将半填充的项目传递到parse_next_page
,然后再parse_next_page
下一位,并将已完成的item
返回给parse
Running the code below only returns the info collected in parse
If I change the return items
to return request
I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. Im sure its something simple, I just can't see it. 运行下面的代码只会返回
parse
收集的信息。如果我将return items
更改为return request
我将获得包含所有3个“事物”的完整项目,但我只会获得其中的一行,而不是全部5行。我确信它很简单,我只是看不到。
class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
Install pyOpenSSL , sometimes fiddler also creates problem for "https:\\*" requests. 安装pyOpenSSL ,有时小提琴手还会为“ https:\\ *”请求创建问题。 Close fiddler if running and run spider again.
如果正在运行,请关闭提琴手,然后再次运行蜘蛛。 Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler.
另一个问题是您的代码中使用的是parse方法中的生成器,而不是使用“ yeild”将请求返回给scrapy调度程序。 You should do it like this....
你应该这样
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
if request:
yield request
else:
yield item
Sorry about the SSL and Fiddler things.. they were not meant for you. 对不起,SSL和Fiddler的事情..它们不是为您准备的。 I mixed two answers here.. :p Now come to your code, you said
我在这里混合了两个答案..:p现在进入您的代码,您说
Running the code below only returns the info collected in parse
运行下面的代码仅返回解析中收集的信息
that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below. 没错,因为您要返回的5个项目的列表中填充了“ thing1”和“ thing2”返回项目,这不会导致scrapy引擎将请求发送到回调“ parse_next_page”,如下所示。
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
then you said... 然后你说...
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. 这也是正确的,因为您在循环外使用“返回请求”,该循环仅执行循环中创建的最后一个请求,而不执行第一个请求。因此,请制作“请求列表”并在循环外返回,或者在内部使用“ yield request”循环..这肯定应该工作,因为我自己已经测试过相同的情况。 Returning items inside the parse will not retrieve the 'thing3'.
在解析中返回项目将不会检索“ thing3”。
simply apply any one solution and your spider should run like missile.... 只需应用任何一种解决方案,您的蜘蛛就应该像导弹一样运行。
Oh.. yarr.. change the code into this.. 哦.. yarr ..将代码更改为此。
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
yield request
#donot return or yield item here.. only yield request return item in the callback.
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
I think now its pretty clear... 我认为现在非常清楚...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.