简体   繁体   中英

Debug High CPU Usage

I have a managed application that uses UCMA (Unified Communications Managed API) 4.0 SDK. I am trying to debug an issue where the application utilizes 100% of the CPU and system is hung. I have used the SOS extensions to try and debug the root cause. I am currently stuck. I have managed to find the thread IDs that are taking up CPU time but they are mostly unmanaged threads. I really need help with this.

Threads 15, 18, 16, 17, 19, 20 are all unmanaged threads and have the same call stack. Threads 9, 10, 11, 12, 13, 14 are all unmanaged threads and have the same call stack as well. Another question is that threads 21 and 22 appear to be waiting on an event so why are they considered to be runaway threads consuming CPU time?

Does anyone know what the ZwRemoveIoCompletionEx is doing? Is this something that is dormant like the NtWaitForMultipleObjects or could this be chewing up the CPU time? In the case of this application once it spikes to 100% it never goes back down until the application is restarted.

0:000> !loadby sos clr
0:009> .time
Debug session time: Wed May 27 15:47:52.000 2015 (UTC - 4:00)
System Uptime: 31 days 1:05:59.329
Process Uptime: 31 days 1:01:27.000
  Kernel time: 0 days 21:44:58.000
  User time: 1 days 16:51:40.000
0:000> !runaway
 User Mode Time
  Thread       Time
  15:113c      0 days 3:46:30.510
  18:1418      0 days 3:18:07.135
  16:1404      0 days 3:08:01.009
  17:140c      0 days 3:07:19.310
  19:1428      0 days 3:04:56.943
  20:1434      0 days 2:52:51.664
  22:1450      0 days 0:47:50.153
   9:11dc      0 days 0:45:02.904
  21:1440      0 days 0:43:34.623
  12:13cc      0 days 0:33:35.298
  11:1250      0 days 0:32:50.386
  14:fbc       0 days 0:31:57.018
  10:1178      0 days 0:29:12.920
  13:13c4      0 days 0:28:42.048
   2:fa8       0 days 0:03:11.678
   4:1164      0 days 0:02:45.080

0:015> kb
RetAddr           : Args to Child                                                           : Call Site
000007fe`fd36546f : 00000000`272946f0 000007fe`e5394b29 00000000`27295b18 00000000`27295b18 : ntdll!ZwRemoveIoCompletionEx+0xa
00000000`7700c089 : 00000000`1c4981e0 00000000`00000001 00000000`00000001 00000000`00000000 : KERNELBASE!GetQueuedCompletionStatusEx+0xdf
000007fe`e51b634b : 00000000`000009b0 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!GetQueuedCompletionStatusExStub+0x19
000007fe`e538fc0b : 00000000`1c4981e0 00000000`1c4981e0 000007fe`e5905340 00000000`00000000 : rtmpal!RtcPalTaskQueueDequeue+0x17
000007fe`e538f960 : 00000000`1f55fcf0 00000000`00000000 00000000`1db59eb0 00000000`1f55fcf0 : Microsoft_Rtc_Internal_Media!CStreamingEngineImpl::EngineWorkerThread+0x267
000007fe`e51b33c8 : 00000000`00000000 00000000`1c40a6a0 00000000`1c4a4f80 00000000`00000000 : Microsoft_Rtc_Internal_Media!CStreamingEngineImpl::EngineWorkerThreadProc+0xf0
000007fe`f22a3d67 : 00000000`00000000 00000000`1c40a6a0 00000000`00000000 00000000`00000000 : rtmpal!RtcPalSetSchedulerPolicy+0x194
000007fe`f22a3f0e : 000007fe`f233cdb0 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!beginthreadex+0x107
00000000`76fd652d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!endthreadex+0x192
00000000`7720c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

0:022> !clrstack
OS Thread Id: 0x1450 (22)
        Child SP               IP Call Site
000000001fdcda68 000000007723186a [HelperMethodFrame_1OBJ: 000000001fdcda68] System.Threading.WaitHandle.WaitMultiple(System.Threading.WaitHandle[], Int32, Boolean, Boolean)
000000001fdcdba0 000007fee968c64c System.Threading.WaitHandle.WaitAny(System.Threading.WaitHandle[], Int32, Boolean)
000000001fdcdc00 000007fe8e097a70 Microsoft.Rtc.Internal.Media.RtpEventHandlerThread.EventHandlerThreadProc()
000000001fdce8d0 000007fee973d0b5 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000001fdcea30 000007fee973ce19 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000001fdcea60 000007fee973cdd7 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000001fdceab0 000007fee96b0301 System.Threading.ThreadHelper.ThreadStart()
000000001fdcedc8 000007feed44ffe3 [GCFrame: 000000001fdcedc8] 
000000001fdcf0f8 000007feed44ffe3 [DebuggerU2MCatchHandlerFrame: 000000001fdcf0f8]

0:021> kb
RetAddr           : Args to Child                                                           : Call Site
000007fe`fd331430 : 00000000`00190398 00000000`771f3a92 00000000`c0000008 00000000`00000110 : ntdll!NtWaitForMultipleObjects+0xa
00000000`76fd1220 : 00000000`1edefc18 00000000`1edefc00 00000000`00000000 00000000`00da7a64 : KERNELBASE!WaitForMultipleObjectsEx+0xe8
000007fe`e53bc322 : 00000000`0000cae8 00816179`f67cb320 00000000`1c497eb0 00000000`1edefce0 : kernel32!WaitForMultipleObjects+0xb0
000007fe`e51b33c8 : 00000000`00000000 00000000`00000000 00000000`1dad4630 00000000`1c4a5160 : Microsoft_Rtc_Internal_Media!CStreamingEngineImpl::TimerThreadProc+0x37e
000007fe`f22a3d67 : 00000000`00000000 00000000`1dad4630 00000000`00000000 00000000`00000000 : rtmpal!RtcPalSetSchedulerPolicy+0x194
000007fe`f22a3f0e : 000007fe`f233cdb0 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!beginthreadex+0x107
00000000`76fd652d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!endthreadex+0x192
00000000`7720c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

0:013> kb
RetAddr           : Args to Child                                                           : Call Site
000007fe`fd36546f : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!ZwRemoveIoCompletionEx+0xa
00000000`7700c089 : 00000000`00000000 00000000`000000b7 00000000`00000001 00000000`1c4a4a40 : KERNELBASE!GetQueuedCompletionStatusEx+0xdf
000007fe`e51c0fef : 000007fe`e5905340 000007fe`e53eb764 00000000`00000000 00000000`1dac2ab0 : kernel32!GetQueuedCompletionStatusExStub+0x19
000007fe`e53eaf4b : 000007fe`e5905340 00000000`35bdd608 00000000`00000001 00000000`1f17fc20 : rtmpal!RtcPalIOCP::GetQueuedCompletionStatus+0x18f
000007fe`e53eac6d : 00000000`00000510 00000000`0000dddd 00000000`1dad9fe0 00000000`1f17fc80 : Microsoft_Rtc_Internal_Media!CTransportManagerImpl::TransportWorkerThread+0xe7
000007fe`e51b33c8 : 00000000`00000000 00000000`1c409460 00000000`1c4a4e40 00000000`00000000 : Microsoft_Rtc_Internal_Media!CTransportManagerImpl::TransportWorkerThreadProc+0x13d
000007fe`f22a3d67 : 00000000`00000000 00000000`1c409460 00000000`00000000 00000000`00000000 : rtmpal!RtcPalSetSchedulerPolicy+0x194
000007fe`f22a3f0e : 000007fe`f233cdb0 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!beginthreadex+0x107
00000000`76fd652d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr110!endthreadex+0x192
00000000`7720c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

Debugging performance issues with WinDbg will need several dumps. One dump is only a snapshot in time and doesn't give you a complete picture.

Right now (at the time the dump was taken), the threads may all be doing nothing. Are you sure it used 100% CPU when you took the dump? Or did it recover from 100% just a millisecond before the dump was taken?

The values displayed by !runaway are accumulated values over the whole lifetime of the program. This just tells you that the thread has worked a lot in the past. It doesn't tell you what it is doing now or will be doing in the future.

Though it has been done by Mark Russinovich and some other cracks, this is not a thing for beginners.

Since you need many dumps to get a complete picture, use other tools to analyze performance issues. Typical tools are called profilers, eg Redgate's ANTS profiler or JetBrains' dotTrace.

If you really want to go this hard way, at least use ProcDump (SysInternals) with -ma -c -n -s options to collect some nice high CPU dumps.

Try Process Explorer and ProcMon from SysInternals . Although it may not give you the answer right away, it gives a lot of context around the process and may help shed some light as to different things the application is doing. In ProcMon, just set a filter to whatever the ProcessName is that you're interested it. In Process Explorer, find the process, right click -> properties. You'll see things like threads and TCP connections and a lot of other things.

Everything went just fine. There's an issue, you use windbg to get a stack trace, there were some suggestions and then wham the world changed when you uttered the magical words "production environment".

Regarding the rest of your post, I notice this:

In the case of this application once it spikes to 100% it never goes back down until the application is restarted.

Basically this means it's broken, and it's broken badly.

I also observe that you have a multi-threading application, which makes it even harder to figure out where things go wrong.

Things you shouldn't do in a production environment

Well, basically the list is:

  • Debugging
  • Profiling
  • Running unit tests
  • ... the list goes on and on.

If you have a 100% CPU spike that never goes down, it might be interesting as well to look at the rest of the software development process. Do you have automatic (functional) test code? Do you use code coverage to check what you test? In short: do you believe you have a stable environment?

For now, this doesn't help you a bit. That said, after fixing the bug, I think it's important to think about these questions for the long run. I'm not sure if this applies to your case - but in my experience the fact that you have a bug like this usually implies that you still have some hard work to do.

It's already broken, so it's impossible to break it

First things first, let's fix it. Let's face the brutal facts. It's already broken, so if we temporarily break it, that's just fine.

The problem is 100% CPU. What you need to know is where in the code that CPU usage is. A profiler is the tool best suited for that.

Get yourself a professional performance profiler like Red Gate ANTS. Install, start. Also, you need to put the PDB and the source code (same folder structure as your dev machine) on your production server, and I would probably put the Debugging DLL's there. It's all just temporary, after we've found our bug, these things should all disappear again. As I said, you don't want this on a production environment.

Don't be gentle on your production application, just use 'line level' or 'method level' profiling on your source code only. It won't break your application, it'll just make it slower. You can also 'pause' the profiler, which basically means no sampling data is collected anymore -- this is a good idea to do until the bug arrises.

When the bug emerges, continue the profiler again, and capture some data.

Debuggers

In my experience, doing debugging and 'stack tracing' on applications sometimes breaks them. I'm not entirely sure when this happens - but in said production environment, you should restart the application you're debugging after you've stopped your debugging process.

One possible stack tracing application is process explorer from sysinternals. Run as administrator, Double click on the process, go to the 'threads' tab and click 'stack trace' (or double click) for each thread until you find something funny.

If you spot a Wait or Sleep , it's probably just fine.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM