预定的工作任务

Question

Subject: 学科：

I'm trying to implement a basic job scheduling in Java to handle recurrent persisted scheduled task (for a personal learn project). 我正在尝试在Java中实现基本的作业计划，以处理重复的持久性计划任务（针对个人学习项目）。 I don't want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc. 我不想使用任何（即用型）库，例如Quartz / Obsidian / Cron4J / etc。

Objective: 目的：

Job have to be persistent (to handle server shutdown) 作业必须是持久的（以处理服务器关闭）
Job execution time can take up to ~2-5 mn. 作业执行时间最多可能需要2到5百万。
Manage a large amount of job 处理大量工作
Multithread 多线程
Light and fast ;) 轻便快捷;）

All my job are in a MySQL Database. 我所有的工作都在MySQL数据库中。

JOB_TABLE (id, name, nextExecution,lastExecution, status(IDLE,PENDING,RUNNING))

Step by step: 一步步：

Retrieve each job from “ JOB_TABLE ” where “nextExecution > now” AND “status = IDLE“ . 从“ JOB_TABLE “nextExecution > now” AND “status = IDLE“ “ JOB_TABLE ”中检索每个作业。 This step is executed every 10mn by a single thread. 每10秒钟由一个线程执行此步骤。
For each job retrieved, I put a new thread in a ThreadPoolExecutor then I update the job status to “ PENDING ” in my “ JOB_TABLE ”. 对于每个检索到的作业，我在ThreadPoolExecutor放置了一个新线程，然后在“ JOB_TABLE ”中将作业状态更新为“ PENDING ”。
When the job thread is running, I update the job status to “ RUNNING ”. 当作业线程正在运行时，我将作业状态更新为“ RUNNING ”。
When the job is finished, I update the lastExecution with current time, I set a new nextExecution time and I change the job status to “ IDLE ”. 作业完成后，我将使用当前时间更新lastExecution ，设置新的nextExecution时间，并将作业状态更改为“ IDLE ”。

When server is starting, I put each PENDING/RUNNING job in the ThreadPoolExecutor . 服务器启动时，我将每个PENDING / RUNNING作业放入ThreadPoolExecutor 。

Question/Observation: 问题/观察：

Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000) ? 步骤2：ThreadPoolExecutor是否可以处理大量线程（〜20000）？
Should I use a NoSQL solution instead of MySQL ? 我应该使用NoSQL解决方案代替MySQL吗？
Is it the best solution to deal with such use case ? 这是处理此类用例的最佳解决方案吗？

This is a draft, there is no code behind. 这是草稿，没有任何代码。 I'm open to suggestion, comments and criticism! 我愿意提出建议，评论和批评！

Answer 1

I have done similar to your task on a real project, but in .NET. 在.NET中，我已经完成了与您在实际项目中相似的任务。 Here is what I can recall regarding your questions: 关于您的问题，这是我能想到的：

Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000)? 步骤2：ThreadPoolExecutor是否可以处理大量线程（〜20000）？

We discovered that .NET's built-in thread pool was the worst approach, as the project was a web application. 我们发现.NET的内置线程池是最糟糕的方法，因为该项目是一个Web应用程序。 Reason: the web application relies on the built-in thread pool (which is static and thus shared for all uses within the running process) to run each request in separate thread, while maintain effective recycling of threads. 原因：该Web应用程序依赖于内置线程池（该线程池是静态的，因此在运行的进程内用于所有用途是共享的）在单独的线程中运行每个请求，同时保持有效的线程回收。 Employing the same thread pool for our internal processing was going to exhaust it and leave no free threads for the user requests, or spoil their performance, which was unacceptable. 为我们的内部处理使用相同的线程池将耗尽它，并且不会为用户请求保留任何空闲线程，或者破坏其性能，这是不可接受的。

As you seem to be running quite a lot of jobs (20k is a lot for a single machine) then you definitely should look for a custom thread pool. 由于您似乎正在运行大量作业（一台机器需要20k，所以很多工作），所以您绝对应该寻找自定义线程池。 No need to write your own though, I bet there are ready solutions and writing one is far beyond what your study project would require* ^{see the comments} (if I understand correctly you are doing a school or university project). 不过，您无需自己编写，我敢打赌，这里有现成的解决方案，并且编写的解决方案远远超出您的研究项目的要求*。 ^{请参阅评论} （如果我正确理解您正在做的是学校或大学的项目）。

Should I use a NoSQL solution instead of MySQL? 我应该使用NoSQL解决方案代替MySQL吗？

Depends. 要看。 You obviously need to update the job status concurrently, thus, you will have simultaneous access to one single table from multiple threads. 显然，您需要同时更新作业状态，因此，您将可以同时从多个线程访问一个表。 Databases can scale pretty well to that, assuming you did your thing right. 假设您做对了，数据库可以很好地扩展。 Here is what I refer to doing this right: 这就是我所说的正确执行的操作：

Design your code in a way that each job will affect only its own subset of rows in the database (this includes other tables). 设计代码的方式应使每个作业仅影响数据库中它自己的行子集（包括其他表）。 If you are able to do so, you will not need any explicit locks on database level (in the form of transaction serialization levels). 如果可以，则不需要在数据库级别上任何显式的锁定（以事务序列化级别的形式）。 You can even enforce a liberal serialization level that may allow dirty or phantom reads - that will perform faster. 您甚至可以强制执行自由序列化级别，该级别可能允许进行脏读或幻像读取-执行速度更快。 But beware , you must carefully ensure no jobs will concur over the same rows. 但是要当心 ，您必须仔细确保在同一行上没有作业并发。 This is hard to achieve in real-life projects, so you should probably look for alternative approaches in db locking. 在现实项目中很难做到这一点，因此您可能应该在数据库锁定中寻找替代方法。
Use appropriate transaction serialization mode. 使用适当的事务序列化模式。 The transaction serialization mode defines the lock behavior on database level. 事务序列化模式在数据库级别定义锁定行为。 You can set it to lock the entire table, only the rows you affect, or nothing at all. 您可以将其设置为锁定整个表，仅锁定受影响的行或什么都不锁定。 Use it wisely, as any misuse could affect the data consistency, integrity and the stability of the entire application or db server. 明智地使用它，因为任何滥用都会影响整个应用程序或数据库服务器的数据一致性，完整性和稳定性。
I am not familiar with NoSQL database, so I can only advice you to research on the concurrency capabilities and map them to your scenario. 我对NoSQL数据库不熟悉，因此我只能建议您研究并发功能并将其映射到您的方案。 You could end up with a really suitable solution, but you have to check according to your needs. 您最终可能会找到一个非常合适的解决方案，但必须根据需要进行检查。 From your description, you will have to support simultaneous data operations over the same type of objects (what is the analog for a table). 根据您的描述，您将必须支持在相同类型的对象（表的模拟物）上同时进行数据操作。

Is it the best solution to deal with such use case ? 这是处理此类用例的最佳解决方案吗？

Yes and No. 是和否

Yes , as you will encounter one of the difficult tasks developers are facing in real world. 是的，因为您将遇到开发人员在现实世界中面临的困难任务之一。 I have worked with colleagues having more than 3 times my own experience and they were more reluctant to do multi-threading tasks than me, they really hated that. 我与同事的合作经验是我的3倍以上，他们比我更不愿意执行多线程任务，他们真的很讨厌。 If you feel this area is interesting to you, play with it, learn and improve as much as you have to. 如果您觉得这个领域很有趣，请尝试并学习，并尽可能多地提高自己。
No , because if you are working on a real-life project, you need something reliable. 否，因为如果您正在做一个真实的项目，则需要可靠的东西。 If you have so many questions, you will obviously need time to mature and be able to produce a stable solution for such a task. 如果您有很多问题，显然您将需要时间来成熟，并且能够为该任务提供稳定的解决方案。 Multi-threading is a difficult topic for many reasons: 多线程是一个困难的话题，原因有很多：
- It is hard to debug 很难调试
- It introduces many points of failure, you need to be aware of all of them 它引入了许多故障点，您需要了解所有这些点
- It could be a pain for other developers to assist or work with your code, unless you sticked to commonly accepted rules. 除非您遵守公认的规则，否则其他开发人员可能难以协助或使用您的代码。
- Error handling can be tricky 错误处理可能很棘手
- Behavior is unpredictable / undeterministic. 行为是不可预测的/不确定的。
There are existing solutions with high level of maturity and reliability that are the preferred approach for real projects. 现有的成熟度和可靠性高的解决方案是实际项目的首选方法。 Drawback is that you will have to learn them and examine how customizable they are for your needs. 缺点是您将必须学习它们，并检查它们如何满足您的需求。

Anyway, if you need to do it your way, and then port your achievement to a real project, or a project of your own, I can advice you to do this in a pluggable way. 无论如何，如果您需要按自己的方式做，然后将成就移植到一个真实的项目或您自己的项目中，我可以建议您以可插拔的方式进行。 Use abstraction, programming to interfaces and other practices to decouple your own specific implementation from the logic that will set the scheduled jobs. 使用抽象， 接口编程和其他实践将您自己的特定实现与设置计划的作业的逻辑脱钩。 That way, you can adapt your api to an existing solution if this becomes a problem. 这样，如果这成为问题，则可以使您的api适应现有解决方案。

And last, but not least , I did not see any error-handling predictions on your side. 最后但并非最不重要的一点是 ，我没有看到任何错误处理方面的预测。 Think and research on what to do if a job fails. 思考并研究如果工作失败了该怎么办。 At least add a 'FAILED' status or something to persist in such case. 至少添加“失败”状态或在这种情况下可以保留的状态。 Error handling is tricky when it comes to threads, so be thorough on your research and practices. 当涉及到线程时，错误处理非常棘手，因此请仔细研究和实践。

Good luck 祝好运

Answer 2

You can declare the maximum pool size with ThreadPoolExecutor#setMaximumPoolSize(int). 您可以使用ThreadPoolExecutor＃setMaximumPoolSize（int）声明最大池大小。 As Integer.MAX is larger 20000 then technically yes it can. 由于Integer.MAX大于20000，因此从技术上讲可以。

The other question is that does your machine wold support so many thread to run. 另一个问题是您的计算机是否支持这么多线程来运行。 You will have provide enough RAM so each tread will allocate on stack. 您将提供足够的RAM，以便每个踏步都将在堆栈上分配。

Thee should not be problem to address ~20,000 threads on modern desktop or laptop but on mobile device it could be an issue. 在现代台式机或笔记本电脑上处理约20,000个线程应该不成问题，但在移动设备上可能是个问题。

From doc: 从文档：

Core and maximum pool sizes 核心和最大池大小

A ThreadPoolExecutor will automatically adjust the pool size (see getPoolSize()) according to the bounds set by corePoolSize (see getCorePoolSize()) and maximumPoolSize (see getMaximumPoolSize()). ThreadPoolExecutor将根据corePoolSize（请参见getCorePoolSize（））和maximumPoolSize（请参见getMaximumPoolSize（））设置的边界自动调整池大小（请参见getPoolSize（））。 When a new task is submitted in method execute(java.lang.Runnable), and fewer than corePoolSize threads are running, a new thread is created to handle the request, even if other worker threads are idle. 当在方法execute（java.lang.Runnable）中提交新任务，并且正在运行的线程少于corePoolSize线程时，即使其他工作线程处于空闲状态，也会创建一个新线程来处理请求。 If there are more than corePoolSize but less than maximumPoolSize threads running, a new thread will be created only if the queue is full. 如果运行的线程数大于corePoolSize但小于maximumPoolSize，则仅在队列已满时才创建新线程。 By setting corePoolSize and maximumPoolSize the same, you create a fixed-size thread pool. 通过将corePoolSize和maximumPoolSize设置为相同，可以创建固定大小的线程池。 By setting maximumPoolSize to an essentially unbounded value such as Integer.MAX_VALUE, you allow the pool to accommodate an arbitrary number of concurrent tasks. 通过将maximumPoolSize设置为一个本质上不受限制的值（例如Integer.MAX_VALUE），可以允许池容纳任意数量的并发任务。 Most typically, core and maximum pool sizes are set only upon construction, but they may also be changed dynamically using setCorePoolSize(int) and setMaximumPoolSize(int). 通常，核心和最大池大小仅在构造时设置，但也可以使用setCorePoolSize（int）和setMaximumPoolSize（int）动态更改。

More 更多

About the DB. 关于数据库。 Create a solution that is not depend to DB structure. 创建一个不依赖于数据库结构的解决方案。 Then you can set up two enviorements and measure it. 然后，您可以设置两个环境并进行测量。 Start with the technology that you know. 从您知道的技术开始。 But keep open to other solutions. 但是，请保持其他解决方案的开放性。 At the begin the relations DB should keep up with the performance. 在开始时，关系数据库应该跟上性能。 And if you mange it properly the it should not be an issue later. 而且，如果您正确地管理它，那么以后也不应该成为问题。 The NoSQL are used to work with really big data. NoSQL用于处理真正的大数据。 But the best for you is to create both and run some performace tests. 但是，最适合您的是创建两者并运行一些性能测试。

预定的工作任务

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-02-25 11:13:13

解决方案2
1 2014-02-25 10:50:24

预定的工作任务

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-02-25 11:13:13

解决方案2 1 2014-02-25 10:50:24

解决方案1
2 已采纳 2014-02-25 11:13:13

解决方案2
1 2014-02-25 10:50:24