简体   繁体   English

Windows服务中的异常处理最佳实践?

[英]Exception handling best practice in a windows service?

I am currently writing a windows service that runs entirely in the background and does something every day. 我目前正在编写一个完全在后台运行的Windows服务,每天都会做一些事情。 My idea is that the service should be very stable so if something goes wrong it should not stop but try it next day again and of course log the exception. 我的想法是服务应该非常稳定,所以如果出现问题,它应该不会停止,但是第二天再次尝试,当然要记录异常。 Can you suggest me any best practice how to make truly stable windows services? 你能建议我如何制作真正稳定的Windows服务吗?

I have read the article of Scott Hanselman of exception handling best practice where he writes that there are only few cases when you should swallow an exception. 我已经阅读了Scott Hanselman关于异常处理最佳实践的文章 ,他写道,只有少数情况下你应该吞下一个例外。 I think somehow that windows service is one of the few cases, but I would be happy to get some confirmation on that. 我认为Windows服务是少数情况之一,但我很乐意得到一些确认。

'Swallowing' an exception is different to 'abandoning a specific task without stopping the entire process'. “吞咽”异常与“放弃特定任务而不停止整个过程”不同。 In our windows service, we catch exceptions, log their details, then gracefully degrade that task and wait for the next task. 在我们的Windows服务中,我们捕获异常,记录它们的详细信息,然后优雅地降级该任务并等待下一个任务。 We can then use the log to troubleshoot the error while the server is still running. 然后,我们可以使用日志来解决服务器仍在运行时的错误。

The question you should be asking, is should your Windows service be fault tolerant. 您应该问的问题是,您的Windows服务是否应该具有容错能力。 Remebering that any unhandled exceptions will bring the service down, which results in its immediate unavailability. 记住任何未处理的异常都会导致服务中断,导致其立即无法使用。 How do you think your service should behave? 您认为您的服务应如何表现? Should it try and continue servicing whatever it needs to? 它应该尝试并继续提供所需的服务吗? Should it be terminated? 应该终止吗?

In my opinion, you should establish a strong distinction between unrecoverable and recoverable exceptions, ie, exceptions that prevent the continuation of your service (if your "static" data structures are corrupted) and exceptions that just determine the failure of the current operation. 在我看来,您应该在不可恢复和可恢复的异常之间建立强有力的区别,即阻止服务延续的异常(如果您的“静态”数据结构被破坏)和仅确定当前操作失败的异常。 To make clear the distinction you may have to separated exception classes hierarchies. 要明确区分异常类层次结构的区别。

This distinction should go along with a strong distinction between the structures of the "supervisor" part of the service (the one that schedules the periodic action) and the part of the service that actually does such periodic action. 这种区别应该与服务的“主管”部分(计划定期行动的部分)的结构和实际执行此类定期行动的部分之间的强烈区分相关。 In case of a recoverable exception, you could abort the running operation and completely reset this last part, obviously logging all the details of the exception to the system event log; 如果是可恢复的异常,您可以中止正在运行的操作并完全重置最后一部分,显然将异常的所有细节记录到系统事件日志中; on the other hand, if you got an unrecoverable error (supervisor's structures in an inconsistent state and SEH exceptions, of course) you should just log your error and exit, since continuing running in an inconsistent state is much more dangerous than not running at all. 另一方面,如果你遇到了一个不可恢复的错误(管理员的结构处于不一致状态,当然还有SEH异常),你应该记录你的错误然后退出,因为继续以不一致的状态运行比没有运行更危险。

Actually, if you have an unexpected exception that is passed all the way to the top level of your service, you should not continue processing; 其实,如果您有一路传递给你的服务的顶层意外的异常,你应该继续处理; log it and propogate it. 记录它并传播它。 If you truly need a "reliable" service, then you'll need a "watchdog" that restarts the original service when it exits. 如果您确实需要“可靠”服务,那么您将需要一个“看门狗”,在退出时重新启动原始服务。

Note that modern operating systems act as a watchdog, so you don't need a watchdog service in most cases (check out the "Recovery" tab under your Service properties). 请注意,现代操作系统充当监视程序,因此在大多数情况下您不需要监视程序服务(请查看“服务”属性下的“恢复”选项卡)。 Historically, critical services would have a second "watchdog" service whose sole purpose is to restart the real service if it fails. 从历史上看,关键服务将拥有第二个“监视”服务,其唯一目的是在失败时重新启动实际服务。

It sounds like your design may be able to make use of the scheduler; 听起来你的设计可能能够利用调度程序; just let Windows take care of the "once a day" part and just have your service do the task a single time. 让Windows负责“每天一次”部分,让您的服务一次完成任务。 If it fails, fine; 如果失败,那很好; Windows is responsible for starting it again the next day. Windows负责第二天再次启动它。

One final note: this level of reliability in a service is rarely needed. 最后要注意的是:很少需要服务中的这种可靠性 In commercial code, I've only seen it used in a couple of antivirus programs and a network filtering program (that had to be running or else all network communication would fail). 在商业代码中,我只看到它用于几个防病毒程序和网络过滤程序(必须运行,否则所有网络通信都会失败)。 I've done a couple "watchdog" programs myself, but these were for customers like auto companies who would lose tons of money when their assembly line systems went down. 我自己做了几个“看门狗”程序,但这些都是针对汽车公司这样的客户,当他们的装配线系统停机时会损失大量资金。 In addition to the software watchdog, these systems also had redundant power supplies, RAIDed hot-swappable hard drives, and a complete duplicate of the entire system for use as an automatic failover. 除软件监视器外,这些系统还具有冗余电源,RAIDed热插拔硬盘驱动器以及整个系统的完整副本,可用作自动故障转移。

Just saying: you may want to reconsider how much you really need to increase reliability (keeing in mind that 100% reliability is impossible; it can only be approached, at exponential cost). 只是说:你可能想重新考虑你真正需要多少来提高可靠性(记住100%的可靠性是不可能的;它只能以指数成本来处理)。

A service should never stop. 服务永远不会停止。 There are two classes of errors, errors in the Service itself, and errors in data provided to the service. 有两类错误,服务本身存在错误,以及提供给服务的数据错误。 Data Errors should be reported but not ignored. 应报告数据错误,但不应忽略。 These two goals can be accomplished by having the service log errors, by providing a way to transmit error information to the user, and by having the service retry the failure after the user (or programmer in the case of an error in the service) has corrected what caused the service to fail (obviously the service will have to be stopped, re-installed, and re-started if a program error is corrected). 这两个目标可以通过提供服务日志错误,通过提供向用户传输错误信息的方法,以及让服务在用户(或服务中的错误的程序员)之后重试失败来实现。纠正了导致服务失败的原因(显然,如果纠正了程序错误,必须停止,重新安装并重新启动服务)。

Like so many things in software development rarely does "one size fit all". 像软件开发中的许多东西很少“一刀切”。 If you deem it appropriate to swallow the exception with the intention of retrying at a later date then that's perfectly reasonable. 如果您认为在以后重新尝试吞下异常是合适的,那么这是完全合理的。 What really does matter is that you clean up after yourself, log and determine a reasonable retry policy before notifying someone. 真正重要的是你自己清理,记录并确定一个合理的重试政策,然后再通知某人。

The Exception Handling Block of the Enterprise Library may prove useful as you can modify your exception policy within config without changing the code. 企业库的异常处理块可能非常有用,因为您可以在配置中修改异常策略而无需更改代码。

Swallowing exceptions is rarely a good idea and as Scott says in his article, there really are only a few valid cases where it might be the best option. 吞咽异常很少是一个好主意,斯科特在他的文章中说,实际上只有少数有效案例可能是最好的选择。

My advice would be to firstly, know what exceptions you're catching and catch them. 我的建议是首先要知道你捕获的异常并抓住它们。 It'll be more useful to you in the future if you know what you're catching rather than the generic (Exception e) 如果您知道自己所捕获的内容而不是通用内容(Exception e) ,那么将来对您将更有用。

Once you've caught the exception then as you stated above, writing that to a logging service, perhaps emailing the details to the maintainer of the code or even firing off another event that sets up a re-try of the code with a limit on the number of attempts before a new message is issued to the code maintainer. 一旦你发现异常然后如上所述,将其写入日志记录服务,可能会将详细信息通过电子邮件发送给代码的维护者,甚至触发另一个事件,设置重新编写代码并限制向代码维护者发出新消息之前的尝试次数。

By catching specific exceptions you can do specific things about them. 通过捕获特定的例外,您可以对它们进行特定的处理。 You can also catch the general exception to ensure that exceptions you really didn't expect don't cause a complete system failure. 您还可以捕获常规异常,以确保您实际上没有预期的异常不会导致完整的系统故障。

Once you know about exceptions you weren't aware of before, these can then be refactored into the next release with a more ideal way of handling them. 一旦您了解了之前未了解的异常,就可以使用更理想的方式将这些异常重构到下一个版本中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM