简体   繁体   English

Scrapy没有通过请求回调从项目中的报废链接返回附加信息

[英]Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

Basically the code below scrapes the first 5 items of a table. 基本上,下面的代码将刮擦表的前5个项目。 One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. 字段之一是另一个href,单击该href提供了我想收集并添加到原始项目的更多信息。 So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse 因此, parse应该将半填充的项目传递到parse_next_page ,然后再parse_next_page下一位,并将已完成的item返回给parse

Running the code below only returns the info collected in parse If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. Im sure its something simple, I just can't see it. 运行下面的代码只会返回parse收集的信息。如果我将return items更改为return request我将获得包含所有3个“事物”的完整项目,但我只会获得其中的一行,而不是全部5行。我确信它很简单,我只是看不到。

class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
        print 'hello2'
        request.meta['item'] = item
        items.append(item)      

    return items


def parse_next_page(self, response):
    print 'stuff'
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
    return item

Install pyOpenSSL , sometimes fiddler also creates problem for "https:\\*" requests. 安装pyOpenSSL ,有时小提琴手还会为“ https:\\ *”请求创建问题。 Close fiddler if running and run spider again. 如果正在运行,请关闭提琴手,然后再次运行蜘蛛。 Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. 另一个问题是您的代码中使用的是parse方法中的生成器,而不是使用“ yeild”将请求返回给scrapy调度程序。 You should do it like this.... 你应该这样

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
    if request:
         yield request
    else:
         yield item

Sorry about the SSL and Fiddler things.. they were not meant for you. 对不起,SSL和Fiddler的事情..它们不是为您准备的。 I mixed two answers here.. :p Now come to your code, you said 我在这里混合了两个答案..:p现在进入您的代码,您说

Running the code below only returns the info collected in parse 运行下面的代码仅返回解析中收集的信息

that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below. 没错,因为您要返回的5个项目的列表中填充了“ thing1”和“ thing2”返回项目,这不会导致scrapy引擎将请求发送到回调“ parse_next_page”,如下所示。

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
    print 'hello2'
    request.meta['item'] = item
    items.append(item)      

return items

then you said... 然后你说...

 If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. 

that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. 这也是正确的,因为您在循环外使用“返回请求”,该循环仅执行循环中创建的最后一个请求,而不执行第一个请求。因此,请制作“请求列表”并在循环外返回,或者在内部使用“ yield request”循环..这肯定应该工作,因为我自己已经测试过相同的情况。 Returning items inside the parse will not retrieve the 'thing3'. 在解析中返回项目将不会检索“ thing3”。

simply apply any one solution and your spider should run like missile.... 只需应用任何一种解决方案,您的蜘蛛就应该像导弹一样运行。

Oh.. yarr.. change the code into this.. 哦.. yarr ..将代码更改为此。

def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
    print 'hello2'
    yield request
    #donot return or yield item here.. only yield request return item in the callback.


def parse_next_page(self, response):
    print 'stuff'
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
    return item

I think now its pretty clear... 我认为现在非常清楚...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM