从脚本运行 scrapy 蜘蛛

Question

I would like to run my scrapy sprider from python script.我想从 python 脚本运行我的 scrapy 蜘蛛。 I can call my spider with the following code,我可以使用以下代码调用我的蜘蛛，

subprocess.check_output(['scrapy crawl mySpider'])

Untill all is well.直到一切都好。 But before that, I instantiate the class of my spider by initializing the start_urls, then the call to scrapy crawl doesn't work since it doesn't find the variable start_urls.但在此之前，我通过初始化 start_urls 来实例化我的蜘蛛的 class，然后调用 scrapy crawl 不起作用，因为它没有找到变量 start_urls。

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = url

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy crawl mySpider'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)

    data.append(mySpider.run())
    return jsonify({'data': data})

if __name__ == "__main__":
    app.run(debug=True)

The error I get is: TypeError: init missing 1 required positional argument: 'start_url' and 'pages'我得到的错误是： TypeError: init missing 1 required positional argument: 'start_url' and 'pages'

any help please?请问有什么帮助吗？

Answer 1

Another way to start your spider from a script (and provide arguments):从脚本启动蜘蛛的另一种方法（并提供参数）：

from scrapy.crawler import CrawlerProcess
from path.to.your.spider import ClassSpider
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl(
    ClassSpider,
    start_urls, # you need to define it somewhere
    number_of_pages, # you need to define it somewhere
)
process.start()

Answer 2

The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider which creates a new instance of ClassSpider .您收到此错误消息的原因是您使用命令scrapy crawl mySpider启动了爬取过程，该命令创建了一个新的ClassSpider实例。 It does so without passing url and nbrPage .它这样做没有通过url和nbrPage 。
It could work if you replaced subprocess.check_output(['scrapy crawl mySpider']) with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']) .如果您将subprocess.check_output(['scrapy crawl mySpider'])替换subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']) ，它可能会起作用。 Also you should make sure that start_urls is a list.此外，您应该确保 start_urls 是一个列表。
However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run as a function taking url and nbrPage as arguments.但是，您仍然会创建同一个蜘蛛的两个单独实例，因此我建议您应该将run作为 function 以 url 和 nbrPage 作为 arguments 实现。
There are also other methods of using Scrapy and Flask in the same script.还有其他在同一脚本中使用 Scrapy 和 Flask 的方法。 For that purpose check this question .为此，请检查此问题。

从脚本运行 scrapy 蜘蛛

问题描述

2 个解决方案

解决方案1
0 2020-06-07 16:40:10

解决方案2
0 已采纳 2020-06-07 16:57:57

从脚本运行 scrapy 蜘蛛

问题描述

2 个解决方案

解决方案1 0 2020-06-07 16:40:10

解决方案2 0 已采纳 2020-06-07 16:57:57

解决方案1
0 2020-06-07 16:40:10

解决方案2
0 已采纳 2020-06-07 16:57:57