寻找调试棘手的Windows服务启动gremlin的想法

Question

In the last few months I've received few reports from QA about one of our services hanging. 在过去的几个月里，我收到了QA关于我们的一项服务的报告。 Upon examining a hang dump using WinDbg, every time I discovered the same thing: Loader lock critical section is locked but owning thread is nowhere to be found. 在使用WinDbg检查挂起转储时，每次我发现同样的事情：Loader锁定临界区被锁定但拥有线程无处可寻。 Since the thread is gone and the only trace that I can see is a global critical section it left behind, I don't see what code ran on thread thread, or even what DLL that thread came from, it may not even be one of ours (ie third party vendor). 由于线程已经消失并且我能看到的唯一跟踪是它留下的全局关键部分，我没有看到在线程线程上运行了什么代码，甚至是线程来自哪个DLL，它甚至可能不是我们的（即第三方供应商）。

This issue is very sporadic, only seen it maybe 3-4 times over the last 6 months occurring naturally in the wild. 这个问题是非常零星的，在过去的6个月中，它看起来可能是野外自然发生的3-4次。 All other times, service runs perfectly. 所有其他时间，服务运行完美。 So this makes me believe it's some kind of timing/race condition thing. 所以这让我相信这是某种时机/竞争条件的事情。

Recently, I've decided to take it upon myself to figure this one out. 最近，我决定自己去做这个。 I setup a machine with WinTask script that constantly starts/stops the said service. 我设置了一个带有WinTask脚本的机器，该脚本不断启动/停止所述服务。 Good news is that within 5-6 hours I can reproduce the problem. 好消息是，我可以在5-6小时内重现问题。

Now for next part: how do I isolate it? 现在为下一部分：我如何隔离它？

This is what I've tried so far: 这是我到目前为止所尝试的：

used "debugger" field in gflags image settings to automagically run the service under cdb whenever it starts. 在gflags图像设置中使用“调试器”字段，以便在cdb启动时自动运行cdb下的服务。 So far this has been running for two days and never hung, so I'm thinking debugger introduced just enough of a timing change to make the issue invisible. 到目前为止，这已经运行了两天而且从未挂起，因此我认为调试器引入了足够的时序更改以使问题不可见。
Downloaded Application Verifier and configured the process to run with that. 下载了应用程序验证程序并将进程配置为与之一起运行。 Found a completely unrelated bug where we create CComBSTR temporary variable, assign it to a VARIANT and pass the variant into a function call even though CComBSTR long deleted the allocated string by that point. 找到一个完全不相关的错误，我们创建CComBSTR临时变量，将其分配给VARIANT并将变量传递给函数调用，即使CComBSTR长时间删除了分配的字符串。 Don't believe this bug is related because string is read-only and the thread it's running on isn't the one that's dying. 不要相信这个bug是相关的，因为string是只读的，并且它运行的线程不是那个正在死的线程。

I'm making this post in case you guys could think of something that I'm not considering. 我正在发帖子，以防你们想到我不考虑的事情。

I though there was a windows utility that artificially put load on the CPU and did other things to make race conditions pop up and I thought application verifier did such a thing, but apparently it doesn't. 我虽然有一个Windows实用程序人为地加载CPU并做了其他事情以使竞争条件弹出，我认为应用程序验证程序做了这样的事情，但显然它没有。 Does anyone know what I'm taking about, or did I just dream that up? 有谁知道我正在采取什么，或者我只是梦想了吗？

Unless something happens over the weekend my next step would be to disable all debuggers, go back to stock and hack one of DllMains to record THREAD_ATTACH/THREAD_DETACH events. 除非在周末发生某些事情，否则我的下一步将是禁用所有调试器，返回库存并破解其中一个DllMains以记录THREAD_ATTACH / THREAD_DETACH事件。 At least I'll be able to intercept the thread that's dying when it gets created. 至少我能够拦截创建时死亡的线程。 That might shed some light. 这可能会有所启发。

Answer 1

Something I might try is attaching a kernel debugger, then run the process under Appilcation Verifier. 我可能尝试的是附加内核调试器，然后在Appilcation Verifier下运行该过程。 AV has checks for unloading a DLL while it holds a CS and terminating threads that still hold CS. AV在它持有CS并终止仍保持CS的线程时检查是否卸载了DLL。 So those breakpoints should trigger in the kernel debugger and then hopefully you can catch it in the act. 所以这些断点应该在内核调试器中触发，然后希望你可以在行为中捕获它。 Running it under the KD hopefully won't slow it down like the user-mode debugger does. 希望在KD下运行它不会像用户模式调试器那样减慢速度。

Answer 2

So turns out I was closer to the solution than I realized. 事实证明，我比解决方案更接近解决方案。 With the service running under cdb, which altered the timing and then running it with application verifier, which altered the timing even more (page heap enabled makes allocation slower), the secret ingredient I was missing was prime95.exe. 随着服务在cdb下运行，它改变了时间，然后用应用程序验证程序运行它，这更改了时间（页面堆启用使分配更慢），我丢失的秘密成分是prime95.exe。 Running prime95.exe at above normal priority, really screwed up whatever timing I was trying not to change, but it made the problem show up in under 15 minutes. 以高于正常优先级的方式运行prime95.exe，确实搞砸了我试图不改变的任何时间，但它使问题出现在15分钟内。

The cause: 原因：

3rd party SDK for acquiring data from hardware boards. 第三方SDK，用于从硬件板获取数据。 When our service starts up, we would query different capture components for their capabilities. 当我们的服务启动时，我们会查询不同的捕获组件的功能。 After the query is done, we release the component instance. 查询完成后，我们释放组件实例。 Apparently this one DLL started a separate thread, which acquired a loader lock and then proceeded to do a bunch of initialization in that thread. 显然这个DLL启动了一个单独的线程，它获取了一个加载器锁，然后继续在该线程中进行一堆初始化。 If during that time, our capability query got done and we released the component, their code would call TerminateThread() on this other thread leaving the loader lock permanently locked. 如果在那段时间内，我们的功能查询完成并且我们发布了组件，他们的代码将在另一个线程上调用TerminateThread（），使加载器锁永久锁定。 Prime95 slowed everything down just enough for me to catch this race condition and get the following verifier stop message: Prime95减慢了所有内容，足以让我抓住这个竞争条件并得到以下验证者停止消息：

=======================================
VERIFIER STOP 00000200: pid 0x1A8C: Thread cannot own a critical section. 

0000091C : Thread ID.
77E17340 : Critical section address.
00000000 : Critical section debug information address.
00000000 : Critical section initialization stack trace.

Funny part is that this thread was "disappearing" without any exception of any kind, so debugger wouldn't even catch first chance anything. 有趣的是，这个线程正在“消失”，没有任何异常，所以调试器甚至不会抓住任何机会。 Who uses TerminateThread???? 谁使用TerminateThread ????

Thank you, everyone for suggestions and support. 谢谢大家的建议和支持。 I was actually starting to look forward to driving to Radioshack during lunch to buy a serial cable and then spending a few days playing with KD. 我实际上开始期待在午餐期间驾驶Radioshack购买串行电缆，然后花几天时间玩KD。 Looks like that will have to wait till next time :) 看起来这将要等到下一次:)

Answer 3

Some random ideas: If attaching a debugger doesn't help, then instrumentation (your last point) is the next step. 一些随机的想法：如果附加调试器没有帮助，那么下一步就是检测（最后一点）。 But how can a thread just die without bringing down the whole process, are you catching exceptions somewhere? 但是如果一个线程如何在不降低整个过程的情况下死掉，你是否会在某处捕获异常？ You might want to be logging there as well. 您可能也希望在那里登录。 You can also set WinDbg to break on all first-chance exceptions, if that helps. 如果有帮助，您还可以将WinDbg设置为中断所有第一次机会异常。 The WinDbg output window will show first-chance exceptions anyways even if you don't break. 即使你没有中断，WinDbg输出窗口也会显示第一次机会异常。

Answer 4

I would try a non-invasive debugger, and see how that goes, while you won't be able to stop the process, you should be able to see any debugging messages as well as any threads that start and stop, and it should have minimal impact on process performance. 我会尝试一个非侵入式调试器，看看怎么回事，虽然你将无法停止进程，你应该能够看到任何调试消息以及任何启动和停止的线程，它应该有对过程绩效的影响最小。 I usually use windbg for my debugging, but I think cbd has similar options as well. 我通常使用windbg进行调试，但我认为cbd也有类似的选项。 This will most likely let you see what's happening in the process, and at least start helping to narrow it down. 这很可能会让你看到这个过程中发生了什么，至少开始帮助缩小范围。 One thing you might want to make sure to do is to redirect the output (.logopen in windbg) to make sure that nothing goes outside of your buffer. 您可能希望确保做的一件事是重定向输出（windbg中的.logopen）以确保没有任何内容超出缓冲区。

寻找调试棘手的Windows服务启动gremlin的想法

问题描述

4 个解决方案

解决方案1
2 已采纳 2012-01-15 16:05:39

解决方案2
1 2012-01-17 02:43:43

解决方案3
0 2012-01-15 17:13:26

解决方案4
0 2012-01-16 21:04:39

寻找调试棘手的Windows服务启动gremlin的想法

问题描述

4 个解决方案

解决方案1 2 已采纳 2012-01-15 16:05:39

解决方案2 1 2012-01-17 02:43:43

解决方案3 0 2012-01-15 17:13:26

解决方案4 0 2012-01-16 21:04:39

解决方案1
2 已采纳 2012-01-15 16:05:39

解决方案2
1 2012-01-17 02:43:43

解决方案3
0 2012-01-15 17:13:26

解决方案4
0 2012-01-16 21:04:39