[英]Decoupling single spider into different spiders in scrapy
I'd like to decouple parsing into a different spiders.我想将解析解耦为不同的蜘蛛。
Currently I have:目前我有:
class CategoriesSpider(scrapy.Spider):
name = 'categories'
allowed_domains = ['example.org']
start_urls = ['https://example.org/categories']
def parse(self, response):
for link in response.xpath("//div[@class='something']/@href"):
yield scrapy.Request(response.urljoin(link.root), callback=self.parse_actual_item_i_want)
def parse_actual_item_i_want(self, response):
yield self.find_the_item(response)
And now split it into:现在将其拆分为:
class OneThingSpider(scrapy.Spider):
name = 'one_thing'
allowed_domains = ['example.org']
start_urls = ['https://example.org/']
def __init__(self, url: str):
if url == None or url == "":
raise ValueError("Invalid url given")
# Exact URL and it's format is not known
self.start_urls = [url]
def parse(self, response):
yield self.find_the_item(response)
So that if only one thing is updated I can only use OneThingSpider
所以如果只更新一件事,我只能使用OneThingSpider
So where and how I call the OneThingSpider
inside CategoriesSpider
or pipeline?那么我在哪里以及如何在CategoriesSpider
或管道中调用OneThingSpider
呢?
CategoriesSpider
attempt #1:在CategoriesSpider
尝试 #1 中: def parse(self, response):
for link in response.xpath("//div[@class='something']/@href"):
yield self.crawler.crawl("one_thing", response.urljoin(link.root))
In CategoriesSpider
:在CategoriesSpider
:
def parse(self, response):
for link in response.xpath("//div[@class='something']/@href"):
yield CategoryUrl({"url": response.urljoin(link.root)})
In pipeline:在管道中:
class MyPipeline(object):
def process_item(self, item, spider):
if isinstance(item, CategoryUrl):
spider.crawler.crawl("one_thing", item["url"])
return
# this part works
if isinstance(item, ItemIWant):
with open(...)
Same change in the CategoriesSpider
as in attempt #1. CategoriesSpider
更改与尝试 #1 中的更改相同。 Here we simply are trying to push the URLs to the OneThingSpider
after CategoriesSpider
has closed.在这里,我们只是尝试在CategoriesSpider
关闭后将 URL 推送到OneThingSpider
。
In pipeline:在管道中:
class MyPipeline(object):
urls = []
def process_item(self, item, spider):
if isinstance(item, CategoryUrl):
self.urls.append(item["url"])
spider.crawler.crawl("one_thing", item["url"])
return
# this part works
if isinstance(item, ItemIWant):
with open(...)
def close_spider(self, spider):
if spider.name == "categories":
for url in self.urls:
spider.crawler.crawl("one_thing", url)
The error I get is that the crawler is already crawling .我得到的错误是爬虫已经在爬行。 So what is the correct way for decoupling spider into smaller spiders?那么将蜘蛛解耦成更小的蜘蛛的正确方法是什么?
Goal is that I can run:目标是我可以运行:
scrapy crawl categories
and also并且
scrapy crawl one_thing -a url="https://example.org/something/xyz.html"
Usually you'd use class inheritence for reusing spider code in multiple spiders:通常你会使用类继承在多个蜘蛛中重用蜘蛛代码:
class BaseSpider(Spider):
start_urls = NotImplemented
start_body = ""
def start_requests(self):
for url in self.start_urls:
yield Request(url, method='POST', body=self.start_body)
def parse(self, response):
raise NotImplementedError
# then children spiders
class CakeSpider(BaseSpider):
name = 'cakes'
start_urls = ['http://example1.com']
start_body = '{"category": "cakes"}'
def parse(self, response):
# custom parser for first spider here
class VegetableSpider(BaseSpider):
name = 'vegetables'
start_urls = ['http://example2.com']
start_body = '{"category": "vegetables"}'
def parse(self, response):
# custom parser for second spider here
As a result you'd run scrapy crawl cakes
to start crawling cake category and scrapy crawl vegetables
to crawl vegetable category.因此,您将运行scrapy crawl cakes
以开始爬取蛋糕类别,并运行scrapy crawl vegetables
来爬取蔬菜类别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.