简体   繁体   中英

IIS app pool crashing on Azure load-balanced VMs

We have a new ASP.NET website running on a pair of load balanced Azure VMs. The website is fairly simple and uses Kentico CMS. Twice in the 24 hours since going live the application pool on both web servers has suddenly stopped (within 5-10 minutes of each other) causing 503: Service unavailable errors.

Looking at Windows system logs I see the error which caused the problem:

Application pool '[[NAME]]' is being automatically disabled due to a series of failures in the process(es) serving that application pool.

Leading up to this are a series of warnings:

A process serving application pool '[[NAME]]' suffered a fatal communication error with the Windows Process Activation Service. The process id was '[[PROCESS ID]]'. The data field contains the error number.

Evidently this is IIS's rapid-fail protection kicking in. What's not clear is how to find the cause of this "fatal communication error".

After some web searching I've installed the Debug Diagnostics Tool which has helped me identify that in every case the relevant process was the IIS worker process (w3wp.exe). This tool is new to me and unfortunately the only time the problem occurred since I installed it, no dumps were generated. However, its logs contain a lot of messages like this:

First chance exception - 0xe0434352 caused by thread with System ID: [[ID]]

The frustrating thing is that I don't know what steps to take to replicate the error conditions. It never occurred in UAT in a very similar environment, even under load test. Here are some facts about my setup:

  • ASP.NET version = 4.5.2
  • Application pool running with identity set to a domain account with modify permission on the website directory
  • Application set with max one worker process

Any advice much appreciated.

* UPDATE 1 *

I now have DebugDiag dump generated by the "fatal communication error" warning event. Dump summary reads:

Dump Summary
------------
Process Name:   w3wp.exe : C:\Windows\SysWOW64\inetsrv\w3wp.exe
Process Architecture:   x86
Exception Code: 0xC00000FD
Exception Information:  The thread used up its stack.
Heap Information:   Present

In the end I tracked this down to a bug in my code. Under very edge-case circumstances the CMS was returning an empty Guid instead of an actual ID which was causing a stack overflow in a recursive method.

The 0xC00000FD exception code I posted above is actually a stack overflow exception, so once I knew that and downloaded the Debug Diagnostcs dump file I was able to replicate the crash scenario locally. That tool, by the way, is incredibly powerful and was able to demonstrate the exact conditions of the crash.

All I can say to people who arrive here with similar issue is - firstly, don't assume the issue is not with your code! And secondly, use Debug Diagnostcs.

First of all, what is your app pool regular recycle time interval setting & overlapping setting in IIS? - If these incidents occur when the recycling is scheduled and overlapping is disabled, this behavior is to be expected. Even when overlapping is enabled, I'd guess that it is somewhat connected to automatic recycling of app pool since both instances are impacted in cca the same time & it occurs twice a day and it can cause logging the warning you mentioned ( Here you might find how to disable logging this warning in case it is caused by automatic recycling )

If that leads nowhere, you can find more details about the warning event here: IIS Application Pool Availability

And about the Debug Diagnostcs tools here: How to use the Debug Diagnostics tool to troubleshoot an IIS process that stops unexpectedly

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM