简体   繁体   English

为Scrapy构建RESTful Flask API

[英]Building a RESTful Flask API for Scrapy

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape. API应该允许包含用户想要抓取的URL的任意HTTP get请求,然后Flask应该返回scrape的结果。

The following code works for the first http request, but after twisted reactor stops, it won't restart. 以下代码适用于第一个http请求,但在反应器停止后,它将不会重新启动。 I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of. 我甚至可能不会以正确的方式解决这个问题,但我只是想在Heroku上放置一个RESTful scrapy API,到目前为止,我所能想到的就是它。

Is there a better way to architect this solution? 有没有更好的方法来构建这个解决方案? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)? 或者我怎样才能让scrape_it返回而不停止扭曲的反应堆(不能再次启动)?

from flask import Flask
import os
import sys
import json

from n_grams.spiders.n_gram_spider import NGramsSpider

# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

app = Flask(__name__)


def scrape_it(url):
    items = []
    def add_item(item):
        items.append(item)

    runner = CrawlerRunner()

    d = runner.crawl(NGramsSpider, [url])
    d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???

    dispatcher.connect(add_item, signal=signals.item_passed)

    reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished


    return items

@app.route('/scrape/<path:url>')
def scrape(url):

    ret = scrape_it(url)

    return json.dumps(ret, ensure_ascii=False, encoding='utf8')


if __name__ == '__main__':
    PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080

    app.run(debug=True, host='0.0.0.0', port=int(PORT))

I think there is no a good way to create Flask-based API for Scrapy. 我认为没有一种好方法可以为Scrapy创建基于Flask的API。 Flask is not a right tool for that because it is not based on event loop. Flask不是一个正确的工具,因为它不是基于事件循环。 To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread. 更糟糕的是,Twisted reactor(Scrapy使用的) 无法在单个线程中多次启动/停止。

Let's assume there is no problem with Twisted reactor and you can start and stop it. 让我们假设Twisted reactor没有问题,您可以启动和停止它。 It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes. 它不会让事情变得更好,因为你的scrape_it函数可能会在很长一段时间内阻塞,因此你需要很多线程/进程。

I think the way to go is to create an API using async framework like Twisted or Tornado; 我认为可行的方法是使用Twisted或Tornado等异步框架创建API; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider. 它将比基于Flask(或基于Django)的解决方案更有效,因为API将能够在Scrapy运行蜘蛛时提供请求。

Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. Scrapy基于Twisted,因此使用twisted.web或https://github.com/twisted/klein可以更直接。 But Tornado is also not hard because you can make it use Twisted event loop. 但是龙卷风也不难,因为你可以使用Twisted事件循环。

There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. 有一个名为ScrapyRT的项目,它的功能与您想要实现的功能非常相似 - 它是Scrapy的HTTP API。 ScrapyRT is based on Twisted. ScrapyRT基于Twisted。

As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application. 作为Scrapy-Tornado集成检查Arachnado的一个例子 - 这里是一个关于如何将Scrapy的CrawlerProcess与Tornado的应用程序集成的示例。

If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. 如果你真的想要基于Flask的API,那么在单独的进程中开始抓取和/或使用像Celery这样的队列解决方案是有意义的。 This way you're loosing most of the Scrapy efficiency; 这样你就失去了大部分的Scrapy效率; if you go this way you can use requests + BeautifulSoup as well. 如果你这样走,你也可以使用requests + BeautifulSoup。

I have been working on similar project last week, it's SEO service API, my workflow was like this: 我上周一直在做类似的项目,它是SEO服务API,我的工作流程是这样的:

  • The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app) 客户端向带有URRL的基于Flask的服务器发送请求以及在完成报废时通知客户端的回调URL(客户端是其他Web应用程序)
  • Run Scrapy in the background using Celery . 使用芹菜在后台运行Scrapy。 The spider will save the data to the database. 蜘蛛会将数据保存到数据库中。
  • The backgound service will notify the client by calling the callback url when the spider is done. 后台服务将通过在蜘蛛完成时调用回调URL来通知客户端。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM