[英]Building a RESTful Flask API for Scrapy
The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape. API应该允许包含用户想要抓取的URL的任意HTTP get请求,然后Flask应该返回scrape的结果。
The following code works for the first http request, but after twisted reactor stops, it won't restart. 以下代码适用于第一个http请求,但在反应器停止后,它将不会重新启动。 I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of. 我甚至可能不会以正确的方式解决这个问题,但我只是想在Heroku上放置一个RESTful scrapy API,到目前为止,我所能想到的就是它。
Is there a better way to architect this solution? 有没有更好的方法来构建这个解决方案? Or how can I allow scrape_it
to return without stopping twisted reactor (which can't be started again)? 或者我怎样才能让scrape_it
返回而不停止扭曲的反应堆(不能再次启动)?
from flask import Flask
import os
import sys
import json
from n_grams.spiders.n_gram_spider import NGramsSpider
# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
app = Flask(__name__)
def scrape_it(url):
items = []
def add_item(item):
items.append(item)
runner = CrawlerRunner()
d = runner.crawl(NGramsSpider, [url])
d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???
dispatcher.connect(add_item, signal=signals.item_passed)
reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished
return items
@app.route('/scrape/<path:url>')
def scrape(url):
ret = scrape_it(url)
return json.dumps(ret, ensure_ascii=False, encoding='utf8')
if __name__ == '__main__':
PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080
app.run(debug=True, host='0.0.0.0', port=int(PORT))
I think there is no a good way to create Flask-based API for Scrapy. 我认为没有一种好方法可以为Scrapy创建基于Flask的API。 Flask is not a right tool for that because it is not based on event loop. Flask不是一个正确的工具,因为它不是基于事件循环。 To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread. 更糟糕的是,Twisted reactor(Scrapy使用的) 无法在单个线程中多次启动/停止。
Let's assume there is no problem with Twisted reactor and you can start and stop it. 让我们假设Twisted reactor没有问题,您可以启动和停止它。 It won't make things much better because your scrape_it
function may block for an extended period of time, and so you will need many threads/processes. 它不会让事情变得更好,因为你的scrape_it
函数可能会在很长一段时间内阻塞,因此你需要很多线程/进程。
I think the way to go is to create an API using async framework like Twisted or Tornado; 我认为可行的方法是使用Twisted或Tornado等异步框架创建API; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider. 它将比基于Flask(或基于Django)的解决方案更有效,因为API将能够在Scrapy运行蜘蛛时提供请求。
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. Scrapy基于Twisted,因此使用twisted.web或https://github.com/twisted/klein可以更直接。 But Tornado is also not hard because you can make it use Twisted event loop. 但是龙卷风也不难,因为你可以使用Twisted事件循环。
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. 有一个名为ScrapyRT的项目,它的功能与您想要实现的功能非常相似 - 它是Scrapy的HTTP API。 ScrapyRT is based on Twisted. ScrapyRT基于Twisted。
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application. 作为Scrapy-Tornado集成检查Arachnado的一个例子 - 这里是一个关于如何将Scrapy的CrawlerProcess与Tornado的应用程序集成的示例。
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. 如果你真的想要基于Flask的API,那么在单独的进程中开始抓取和/或使用像Celery这样的队列解决方案是有意义的。 This way you're loosing most of the Scrapy efficiency; 这样你就失去了大部分的Scrapy效率; if you go this way you can use requests + BeautifulSoup as well. 如果你这样走,你也可以使用requests + BeautifulSoup。
I have been working on similar project last week, it's SEO service API, my workflow was like this: 我上周一直在做类似的项目,它是SEO服务API,我的工作流程是这样的:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.