简体   繁体   English

使用任务队列安排App Engine(Python)中多个提要的获取/解析

[英]Using Task Queues to schedule the fetching/parsing of a number of feeds in App Engine (Python)

Say I had over 10,000 feeds that I wanted to periodically fetch/parse. 假设我要定期抓取/解析超过10,000个提要。 If the period were say 1h that would be 24x10000 = 240,000 fetches. 如果周期为1h,则为24x10000 = 240,000次抓取。

The current 10k limit of the labs Task Queue API would preclude one from setting up one task per fetch. 目前,实验Task Queue API的上限为1万,因此每次抓取都无法设置一项。 How then would one do this? 那怎么办呢?

Update: RE: Fetching nurls per task - Given the 30second timeout per request at some point this would hit a ceiling. 更新: RE:每个任务获取nurls-给定每个请求30秒超时,这在某个时候会达到上限。 Is there anyway to parallelize it so each task queue initiates a bunch of async parallel fetches each of which would take less than 30sec to finish but the lot together may take more than that. 无论如何,是否有将其并行化的方法,因此每个任务队列都会启动一堆异步并行提取,每个异步并行提取将花费不到30秒的时间来完成,但总的花费可能更多。

Here's the asynchronous urlfetch API: 这是异步urlfetch API:

http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html

Set of a bunch of requests with a reasonable deadline (give yourself some headroom under your timeout, so that if one request times out you still have time to process the others). 一组具有合理期限的请求集(在超时下为自己留出一些空间,这样,如果一个请求超时,您仍有时间来处理其他请求)。 Then wait on each one in turn and process as they complete. 然后依次轮流等待每一个,并在完成时进行处理。

I haven't used this technique myself in GAE, so you're on your own finding any non-obvious gotchas. 我没有在GAE中亲自使用过这项技术,因此您可以自行找到任何非显而易见的陷阱。 Sadly there doesn't seem to be a select() style call in the API to wait for the first of several requests to complete. 遗憾的是,API中似乎没有select()样式调用来等待多个请求中的第一个完成。

2 fetches per task? 每个任务2次抓取? 3? 3?

对提取进行分组,因此不必排队1个提取,而是将要进行10个提取的工作单元排队。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM