简体   繁体   English

从脚本运行 scrapy 蜘蛛

[英]Run scrapy spider from script

I would like to run my scrapy sprider from python script.我想从 python 脚本运行我的 scrapy 蜘蛛。 I can call my spider with the following code,我可以使用以下代码调用我的蜘蛛,

subprocess.check_output(['scrapy crawl mySpider'])

Untill all is well.直到一切都好。 But before that, I instantiate the class of my spider by initializing the start_urls, then the call to scrapy crawl doesn't work since it doesn't find the variable start_urls.但在此之前,我通过初始化 start_urls 来实例化我的蜘蛛的 class,然后调用 scrapy crawl 不起作用,因为它没有找到变量 start_urls。

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = url

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy crawl mySpider'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)

    data.append(mySpider.run())
    return jsonify({'data': data})

if __name__ == "__main__":
    app.run(debug=True)

The error I get is: TypeError: init missing 1 required positional argument: 'start_url' and 'pages'我得到的错误是: TypeError: init missing 1 required positional argument: 'start_url' and 'pages'

any help please?请问有什么帮助吗?

Another way to start your spider from a script (and provide arguments):从脚本启动蜘蛛的另一种方法(并提供参数):

from scrapy.crawler import CrawlerProcess
from path.to.your.spider import ClassSpider
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl(
    ClassSpider,
    start_urls, # you need to define it somewhere
    number_of_pages, # you need to define it somewhere
)
process.start()

The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider which creates a new instance of ClassSpider .您收到此错误消息的原因是您使用命令scrapy crawl mySpider启动了爬取过程,该命令创建了一个新的ClassSpider实例。 It does so without passing url and nbrPage .它这样做没有通过urlnbrPage
It could work if you replaced subprocess.check_output(['scrapy crawl mySpider']) with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']) .如果您将subprocess.check_output(['scrapy crawl mySpider'])替换subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']) ,它可能会起作用。 Also you should make sure that start_urls is a list.此外,您应该确保 start_urls 是一个列表。
However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run as a function taking url and nbrPage as arguments.但是,您仍然会创建同一个蜘蛛的两个单独实例,因此我建议您应该将run作为 function 以 url 和 nbrPage 作为 arguments 实现。
There are also other methods of using Scrapy and Flask in the same script.还有其他在同一脚本中使用 Scrapy 和 Flask 的方法。 For that purpose check this question .为此,请检查此问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM