简体   繁体   English

在scrapy中运行多个蜘蛛 - 找不到蜘蛛

[英]Running multiple spiders in scrapy - spider not found

As the title suggests, I'm trying to use multiple spiders in scrapy.正如标题所暗示的,我试图在scrapy中使用多个蜘蛛。 One spider, news_spider works using the command一只蜘蛛, news_spider使用命令工作

scrapy crawl news_spider -o news.json . scrapy crawl news_spider -o news.json It produces the exact result I expect.它产生了我期望的确切结果。

However, when I try to use the spider quotes_spider using the following command但是,当我尝试使用以下命令使用蜘蛛网名时

scrapy crawl quotes_spider -o quotes.json

I receive the following message, " Spider not found: quotes_spider "我收到以下消息,“未找到蜘蛛:quotes_spider

And just for some history, I created quotes_spider first and it was working.只是为了一些历史,我首先创建了quotes_spider并且它正在工作。 I then duplicated it as news_spider and edited, at which time I moved quotes_spider out of spiders directory.然后我将它复制为news_spider并进行编辑,此时我将quotes_spider从蜘蛛目录中移出。 Now that I have news_spider working, I moved quotes_spider back in to spiders directory and got the above ERROR message.现在我已经让news_spider工作了,我将quotes_spider移回了蜘蛛目录并得到了上面的错误消息。

The directory tree looks like this目录树看起来像这样

tutorial
├── news.json
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-37.pyc
    │   ├── items.cpython-37.pyc
    │   └── settings.cpython-37.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── quotes.jl
    ├── quotes.json
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-37.pyc
        │   ├── news_spider.cpython-37.pyc
        │   └── quotes_spider.cpython-37.pyc
        ├── news_spider.py
        └── quotes_spider.py

News Spider:新闻蜘蛛:

from scrapy.exporters import JsonLinesItemExporter
from tutorial.items import TutorialItem

# Scrapy Spider
class FinNewsSpider(scrapy.Spider):
    # Initializing log file
    # logfile("news_spider.log", maxBytes=1e6, backupCount=3)
    name = "news_spider"
    allowed_domains = ['benzinga.com/']
    start_urls = [
        'https://www.benzinga.com/top-stories/20/09/17554548/stock-wars-ford-vs-general-motors-vs-tesla'
    ]

# MY SCRAPY STUFF
# response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
    def parse(self, response):
        paragraphs = response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
        print(paragraphs)
        for p in paragraphs:
            yield TutorialItem(content=p)

Quotes Spider:行情蜘蛛:

from scrapy.exporters import JsonLinesItemExporter

class QuotesSpider(scrapy.Spider):
    name = "quotes"

#### Actually don't have to use the start_requests function since it's built in. Can just use start_urls
    # def start_requests(self):
    #     urls = [
    #         'http://quotes.toscrape.com/page/1/',
    #         'http://quotes.toscrape.com/page/2/'
    #     ]
    #     for url in urls:
    #         yield scrapy.Request(url=url, callback=self.parse)
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/'
    ]

#### Original parse to just get the entire page
    # def parse(self, response):
    #     page = response.url.split("/")[-2]
    #     filename = 'quotes-%s.html' % page
    #     with open(filename, 'wb') as f:
    #         f.write(response.body)
    #     self.log('Saved file %s' % filename)

#### Parse to actually gather targeted info
    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").get(),
                'author': quote.css("small.author::text").get(),
                'tags': quote.css("div.tags a.tag::text").getall()
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

I have searched SO and the answers I found regarding multiple spiders all seem related to running multiple spiders concurrently, which is not what I'm trying to do, so I have not found an answer to why one of these works and one does not.我已经搜索过 SO,我找到的关于多个蜘蛛的答案似乎都与同时运行多个蜘蛛有关,这不是我想要做的,所以我没有找到为什么其中一个有效而另一个无效的答案。 Can anyone see an error in my code that I might be overlooking?任何人都可以在我的代码中看到我可能忽略的错误吗?

The problem is how you are executing it.问题是你如何执行它。 The name of you quotes spider is "quotes" not "quotes_spider"你引用蜘蛛的名字是“quotes”而不是“quotes_spider”

class QuotesSpider(scrapy.Spider):
    name = "quotes"

Therefore the command to run it is :因此运行它的命令是:

scrapy crawl quotes -o quotes.json

Just like the name of your news spider is "news_spider"就像你的新闻蜘蛛的名字是“news_spider”

class FinNewsSpider(scrapy.Spider):
    # Initializing log file
    # logfile("news_spider.log", maxBytes=1e6, backupCount=3)
    name = "news_spider"

And you execute it with你执行它

scrapy crawl news_spider -o news.json

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM