简体繁体 English

我可以为我的应用程序使用cron作业（需要具有高度可伸缩性）吗？

[英]Can I use cron jobs for my application (needs to be extremely scalable)?

原文 2012-10-09 19:26:13 7 2 php/ linux/ api/ cron/ cron-task

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. 我将要进行一个大型项目，在该项目中，我将需要计划任务（cron作业）来运行一个脚本，该脚本将遍历我的整个实体数据库，并每10分钟调用一次Facebook，Twitter和Foursquare等多个API 。 I need this application to be scalable. 我需要此应用程序具有可伸缩性。

I can already foresee a few potential pitfalls... 我已经可以预见到一些潜在的陷阱。

Fetching data from API's is slow.. 从API提取数据很慢。
With thousands of records (constantly increasing) in my database, it's going to take too much time to process every record within 10 minutes. 由于数据库中有成千上万条记录（不断增加），因此在10分钟内处理每条记录将花费太多时间。
Some shared servers only stop scripts running after 30 seconds. 某些共享服务器仅在30秒后停止运行脚本。
Server issues due to constant intensive scripts running. 由于不断运行密集脚本而导致服务器出现问题。

My question is how to structure my application...? 我的问题是如何构建我的应用程序...？

Could I create multiple cron jobs to handle small segments of my database (this will have to be automated)? 我可以创建多个cron作业来处理数据库的一小部分（这将是自动化的）吗？
This will require potentially thousands of cron jobs.. Is that sustainable? 这将可能需要成千上万的cron工作。
How to bypass the 30 sec issue with some servers? 如何绕过某些服务器的30秒问题？
Is there a better way to go about this? 有没有更好的方法来解决这个问题？

Thanks! 谢谢！

2 个解决方案

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. 我将要进行一个大型项目，在该项目中，我将需要计划任务（cron作业）来运行一个脚本，该脚本将遍历我的整个实体数据库，并每10分钟调用一次Facebook，Twitter和Foursquare等多个API 。 I need this application to be scalable. 我需要此应用程序具有可伸缩性。

Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers. 最好的选择是设计应用程序以使用分布式数据库，并将其部署在多台服务器上。

You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce"). 您可以将其设计为在两个“等级”的服务器中工作，这与map-reduce方法不同：轻量级服务器仅执行查询并“预先消化”某些数据（“ map”），以及聚合数据的服务器（“降低”）。

Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. 完成此操作后，您可以建立性能基准并进行计算，例如，如果每分钟可以生成2000个查询，并且可以处理尽可能多的响应，则每20,000个用户需要一台新服务器。 In that "generate 2000 queries per minute" you need to factor in: 在“每分钟生成2000个查询”中，您需要考虑以下因素：

data retrieval from the database 从数据库中检索数据
traffic bandwidth from and to the control servers 控制服务器之间的流量带宽
traffic bandwidth to Facebook, Foursquare, Twitter etc. Facebook，Foursquare，Twitter等的流量带宽
necessity to log locally (and maybe distill and upload log digests to Command and Control) 必须在本地进行日志记录（也许可以提取日志摘要并将其上传到“命令和控制”）

An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. 这种体系结构的优点是您可以从小规模开始-可以用一台同时运行连接器，映射器，Reducer，命令和控制以及持久性的机器来构建测试平台。 When you grow, you just outsource different services to different servers. 当您成长时，您只是将不同的服务外包给了不同的服务器。

On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, eg Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks) 在几个分布式计算平台上，这还允许您通过在地理位置或连接方式上明智地分配Mappers来更快地运行查询，并通过与Amazon“ zones”一起使用来减少各个平台之间的流量成本（Amazon也提供消息服务，您可能会发现对于任务之间的交流很有价值）

One note: I'm not sure that PHP is the right tool for this whole thing. 一个注意事项：我不确定PHP是否适合用于整个过程。 I'd rather think Python. 我宁愿使用Python。

At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . 不过，在每个实例20000个用户的访问量级别上，我认为您最好与Facebook，Foursquare等公司的人员接手。 At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs , to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. 至少您可以收集一些策略，例如将连接器脚本作为独立的任务运行，每个连接器根据该服务的用户ID对其队列进行排序，以利用可能存在的少量数据局部性，并利用流水线方式压缩更多带宽，减少服务器负载。 At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-) 最多，他们可能会指向您使用批量API或其他协议，或者以一万亿美元的价格购买您：-)

See http://php.net/manual/en/function.set-time-limit.php to bypass the 30 second limit. 请参阅http://php.net/manual/en/function.set-time-limit.php以绕过30秒的限制。

For scheduling jobs in PHP look at: 要在PHP中安排作业，请查看：

I personally would look at a more robust framework that handles job scheduling (see Grails with Quartz) instead of reinventing the wheel and writing your own job scheduler. 我个人将看一个处理工作计划的更健壮的框架（请参阅Quartz with Quartz），而不是重新发明轮子并编写自己的工作计划程序。 Don't forget that you are probably going to need to be checking on the status of tasks from time to time so you will need a logging solution around the tasks. 不要忘记，您可能需要不时检查任务的状态，因此您将需要围绕任务的日志记录解决方案。