I am trying to scrap multiple page into one item:
A
|-- a
|-- b
|-- c
B
|-- a
...
By scraping page A and its subpages (a, b, c) I'll get 1 item. My code is huge but here is the shrinked version:
class MySpider(scrapy.Spider):
def parse(self, response):
for li in response.xpath('//li'):
item = MyItem()
...
meta = {
'item': item,
'href': href,
}
url = response.urljoin(href + '?a')
yield scrapy.Request(url, callback=self.parse_a, meta=meta)
def parse_a(self, response):
...
url = response.urljoin(href + '?b')
yield scrapy.Request(url, callback=self.parse_b, meta=meta)
def parse_b(self, response):
...
url = response.urljoin(href + '?c')
yield scrapy.Request(url, callback=self.parse_c, meta=meta)
def parse_c(self, response):
...
yield item
Script works fine but here is the problem: Crawler scrabs pages in following order: A, B, C, Aa, Ba, Ca, Ab, Bb, ...
since there are too many pages to scrab nothing is saved until all of the pages are scrabed. And when I change yield
to return
on parse method it scrabs the way I want A, Aa, Ab, Ac
but it doesn't scrab B, C, ...
如果要强制执行这种类型的订单,我现在想到的唯一方法是在Item Pipeline中指定订单,以便您将返回Ac Bc Cc ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.