简体繁体 English

监控线程

[英]Monitoring threads

原文 2018-12-04 09:59:13 7 1 c#/ multithreading

I have an app that spawns multiple threads. 我有一个产生多个线程的应用程序。 Each thread has quite complex logic. 每个线程都有相当复杂的逻辑。 Sometimes there is a deadlock or other problem and I would like to be informed that Thread is not in a "healthy" state anymore. 有时会出现死锁或其他问题，我想得知Thread不再处于“健康”状态。

Options that come to my mind: 我想到的选择是：

Log and monitor logs 记录和监视日志
Update shared object in memory from each thread (for example dictionary with thread id as a key and status structure: OK, timespan and then in separate thread observe that dictionary and if thread hasn't updated it's status for X minutes then it means that something went wrong) 从每个线程更新内存中的共享对象（例如，以线程ID为键和状态结构的字典：确定，时间跨度，然后在单独的线程中观察该字典，如果线程X分钟未更新其状态，则表示存在某种情况出错）

But I have a feeling that it isn't the best solution. 但是我感觉这不是最好的解决方案。 Is there any pother good practice for monitoring threads? 是否有监视线程的良好实践？ Maybe thread itself can update it's global state with timestamp and process has access to that information? 也许线程本身可以使用时间戳更新它的全局状态，并且进程可以访问该信息？

1 个解决方案

There are a couple of ideas here that come to mind that I'll outline. 在此，我将概述一些想法。

1 Make Your Code More Robust 1使您的代码更健壮

obviously, the easiest way to keep your thread healthy is to account for all these unhealthy states, if you're running into deadlocks use locks & semaphores more appropriately, debug why your deadlocks happen and find a way to account for these bad scenarios. 显然，保持线程健康的最简单方法是解决所有这些不正常的状态，如果您陷入死锁，请更适当地使用锁和信号灯，调试死锁发生的原因并找到解决这些不良情况的方法。

2 Agents & Error Handlers 2代理和错误处理程序

Kind of related to above, if you can get to the point where you can correctly identify these issues without being able to fix them, you can put in place a system where the thread can know it's in a dangerous state and implement a system where that thread can say message the master thread, say that it's in a bad state, the master thread can then shut down the affected thread, spin up a new one from a set "safe state" and try to continue on from there. 与上述相关，如果您可以正确地识别这些问题而又不能解决这些问题，则可以建立一个线程可以知道其处于危险状态的系统，并实施一个可以解决这些问题的系统。线程可以说出消息给主线程，说它处于坏状态，然后主线程可以关闭受影响的线程，从设置的“安全状态”启动一个新线程，然后尝试从那里继续。 Multi-Process based languages like Elixir are very fond of this style of protection. 像Elixir这样的基于多进程的语言都非常喜欢这种保护方式。

3 Logging/Polling 3记录/轮询

If there really is no way for you to tell what/how things are going on (there always is it's just sometimes too difficult) then the thread updating the shared resource on a set timeframe is a fairly simple thing to implement. 如果确实没有办法让您知道发生了什么/如何进行（有时总是太难了），那么在设定的时间范围内更新共享资源的线程就很容易实现。 Every minute have your thread update a float by exactly how long it has been since it last updated it, have your main thread check this float say every few minutes, (leave lots of space if your thread is too busy to update it on time). 每隔一分钟让线程更新一次浮点数，精确到自上一次更新浮点数以来的时间，让主线程每隔几分钟检查一次该浮点数（（如果您的线程太忙而无法按时更新它，请留大量空间）。 If say 5 minutes have gone by without an update then you can be fairly sure your thread is deadlocked. 如果说5分钟没有更新，那么您可以肯定地确定您的线程已死锁。 If you want to be able to see this after the fact then you can replicate these update messages to a log file too with the stack trace that time to see where it's getting stuck 如果您希望事后能够看到此消息，那么您也可以将这些更新消息复制到日志文件中，并使用堆栈跟踪该时间，以查看其卡在哪里

Conclusion 结论

To conclude, if you can fix/reliably tell when an error/infected state is occurring, you can write code to account for those bad states, if you can't do that then your next goal should be getting to a point where you can do that. 总而言之，如果您可以修复/可靠地确定何时发生错误/受感染状态，则可以编写代码来解决这些不良状态，如果您不能这样做，那么下一个目标应该是达到目标去做。