简体繁体 English

随机/间歇服务不可用-IIS7.5

[英]Random/Intermittant Service Unavailable - IIS7.5

原文 2014-02-13 21:07:30 4 3 asp.net/ iis-7.5

We have recently implemented a new ASP.NET site to our webservers to replace our old Classic ASP site(Both severs are Windows 2008 R2 Using IIS 7.5). 我们最近在Web服务器上实现了一个新的ASP.NET站点，以替换旧的Classic ASP站点（两个服务器都是使用IIS 7.5的Windows 2008 R2）。 They are hosted on a Load Balancer. 它们托管在负载均衡器上。

This one .NET webform application is used for approximately 30 clients (each with their own URL. client1.mysite.biz, client2.mysite.biz etc...) 这个.NET Webform应用程序可用于大约30个客户端（每个客户端都有自己的URL。client1.mysite.biz，client2.mysite.biz等）。

Our original plan was deploy our new application into 3 "WebSites" each with their own app pools and BIND the clients to the relevant Website. 我们最初的计划是将我们的新应用程序部署到3个“网站”中，每个网站都有自己的应用程序池，并将客户绑定到相关网站。

When binding we bound to both Http and Https for the URL (we have certificates for each of the sites) 绑定时，我们绑定到URL的Http和Https（我们为每个站点都有证书）

INITIAL PROBLEM: We noticed that after we bound more than half the sites and tested, we were suddenly being greeted with " Service Unavailable. Service is Temporarily Unavailable " (NO NUMBER just the words) every time. 初始问题：我们注意到，在绑定了超过一半的站点并进行了测试之后，我们每次都突然被“ 服务不可用。服务暂时不可用 ”（没有数字）打招呼。 We unbound everything and tried again (meticulously testing each time we bound a site). 我们解除所有绑定，然后再次尝试（每次绑定站点时都要进行仔细测试）。 Each time after binding a certain number of sites the same thing happened. 每次绑定一定数量的站点后，都会发生相同的情况。

We ran out of down time and went to Plan B. We put the whole thing in the "Default Website" as a virtual directory (No bindings) (This is how the Classic ASP site was setup) 我们用尽了停机时间，转到了计划B。我们将整个内容作为虚拟目录（没有绑定）放在“默认网站”中（这是设置Classic ASP网站的方式）

OUR PROBLEM NOW: Occasionally we get the same dreaded white screen with "Service Unavailable. Service is Temporarily Unavailable" (NO NUMBER just the words). 我们现在的问题：有时我们会看到同样可怕的白屏，上面显示“服务不可用。服务暂时不可用”（仅数字）。 It seems to happen randomly (not load or time dependent as far as we can tell). 它似乎是随机发生的（据我们所知，与负载或时间无关）。 If using AJAX it simply is caught in the "Error" portion of the AJAX code but I believe it is the same problem. 如果使用AJAX，它只会陷入AJAX代码的“错误”部分，但我相信这是同样的问题。 The error occurs INSTANTLY when it does happen. 该错误确实会立即发生。 If the user attempts to repeat the action that caused the problem everything is fine (they are not logged out and they proceed on their way). 如果用户尝试重复导致问题的操作，那么一切都很好（他们没有注销，而是继续前进）。

However this is happening MULTIPLE times a day and it's across ALL of our sites (not just this new one). 但是，这种情况一天要发生多次，并且遍及我们所有的站点（而不仅仅是这个新站点）。

One more item of great importance. 还有一项非常重要。 This appears to be happening to ALL of our sites (Virtual Directories and custom WebSites on BOTH of our web servers). 这似乎发生在我们所有的站点上（两个Web服务器上的虚拟目录和自定义WebSite）。 That seems to rule out a "bad" server (both are in the cloud did I mention?) and it also "seems" to rule out App Pool settings but what do I know? 这似乎排除了“不良”服务器（我都提到过这两个都在云中吗？），并且似乎也“排除了” App Pool设置，但是我知道什么？

About our IIS servers: We have multiple application pools running multiple different instances of websites (different code). 关于我们的IIS服务器：我们有多个应用程序池，它们运行着多个不同的网站实例（不同的代码）。 Some are testing sites. 有些是测试站点。 Some are using classic ASP and others and using ASP.NET. 有些使用经典的ASP，另一些使用ASP.NET。

What we've tried: We scoured the web looking for answers and have edited our machine.config file to increase all manner of things such as "Threads, Max-Connections etc...". 我们尝试过的方法：我们在网上搜寻了答案，并编辑了machine.config文件以增加各种功能，例如“线程，最大连接数等”。 We've edited our App Pool settings by increasing our Queue Length and turning on ALL the logs. 我们通过增加队列长度并打开所有日志来编辑应用程序池设置。

Anyone seen anything like this before? 有人看过这样的东西吗？ My theory is it has something to do with the bindings and the frequency of the error is increased for each binding I initiate but that is difficult to test when it happens on my production servers only. 我的理论是，它与绑定有关，并且每次启动的绑定都会增加错误的发生率，但是仅在生产服务器上发生错误时，很难对此进行测试。

3 个解决方案

We have finally solved this problem. 我们终于解决了这个问题。 As mentioned previously, we noticed that the IIS logs contained a sc-win32-status 64 error when we experienced the Service Unavailable problem in the browser when (and only when) our site was using the Load Balancer . 如前所述，当（且仅当）我们的站点正在使用Load Balancer时，当浏览器出现Service Unavailable问题时，我们注意到IIS日志包含一个sc-win32-status 64错误。

To help look into this further, we did a network capture of the traffic on the Load Balancer while testing. 为了帮助进一步了解这一点，我们在测试时通过网络捕获了Load Balancer上的流量。 We reproduced the random Service Unavailable problem, saw the associated win32-status 64 error in the IIS logs, and identified the specific packet of traffic on the network capture for this event. 我们重现了随机出现的“ Service Unavailable问题，在IIS日志中看到了相关的win32-status 64错误，并为此事件在网络捕获中标识了特定的流量数据包。

Using Wireshark , we followed the TCP stream and noticed that the TCP connection was reset by the Load Balancer immediately after this packet. 使用Wireshark ，我们跟踪了TCP流，并注意到在此数据包之后， Load Balancer立即重置了TCP连接。 We reproduced the problem three times and every time there was a TCP reset immediately afterwards. 我们重现了该问题3次，并且每次之后都立即进行TCP重置。

Walking backwards through the TCP stream, we noticed in all three instances a packet for HTTP/1.1 200 (accplication/octet-stream) and prior to that a request to download a document (ie. .pdf or .xlsx or .docx) from one of our sites. 向后浏览TCP流，我们在所有三种情况下都注意到HTTP/1.1 200 (accplication/octet-stream)的数据包HTTP/1.1 200 (accplication/octet-stream)并且在此之前，我们请求从中下载文档（即.pdf或.xlsx或.docx）我们的网站之一。 The server that contains all our documents is not a web server and does not have the IIS role active. 包含我们所有文档的服务器不是Web服务器，并且没有IIS角色处于活动状态。 The document server does not have a way to define the content/media type for the document that is being downloaded. 文档服务器无法为正在下载的文档定义内容/媒体类型。 Hence the generic (application/octet-stream) packet in the network capture. 因此，网络捕获中的通用（应用程序/八位位组流）数据包。 The Load Balancer treated the request for a document as potentially malicious and decided to reset the TCP connection if another request is made. Load Balancer将对文档的请求视为潜在恶意，并决定在发出另一个请求时重置TCP连接。 To fix the problem, we added a content type library function to our application using this post as a guide. 为了解决这个问题，我们使用这个添加的内容类型的库函数，我们的应用程序后作为指导。 Sorted! 排序！

In Summary: 综上所述：

A document was requested from our document server via our web application 通过我们的Web应用程序从我们的文件服务器请求了一个文件
The document was sent back to the user with a generic content type = application/octet-stream 该文档以通用内容类型= application/octet-stream发送给用户
The Load Balancer flagged this activity to be potentially malicious 负载平衡器将该活动标记为潜在恶意
Another request within this TCP connection was made 在此TCP连接中提出了另一个请求
The Load Balancer reset the TCP connection 负载平衡器重置TCP连接
This results in a Service Unavailable 这导致服务不可用

Lesson Learned: 学过的知识：

Always define your content/media types if you are serving content from a non web server or a web server running an IIS version less than 7 (Heaven forbid). 如果要从非Web服务器或运行IIS版本低于7（禁止使用天堂）的Web服务器提供内容，请始终定义内容/媒体类型。

A UC Certificate was originally meant for Microsoft Exchange, but it can also be used to cover multiple domains. UC证书原本是用于Microsoft Exchange的，但也可以用于覆盖多个域。 We use one and it covers about 60+ domains (actually 4 or 5 domains with lots of subdomains). 我们使用一个，它涵盖了大约60多个域（实际上是4或5个具有很多子域的域）。 We also apply the certificate to a load balancer and two web servers and we have multiple sites. 我们还将证书应用于负载均衡器和两个Web服务器，并且我们有多个站点。 So far as I can tell the certificates operate as expected. 据我所知，证书按预期运行。 you can view it from any of the 60+ domains. 您可以从60多个域中的任何一个中查看它。 One odd thing about our setup is that in the IIS UI, you can't bind the same certificate to more than one site so we had to use the appcmd command line interface to bind multiple sites to the same certificate. 关于我们的设置的一件奇怪的事是，在IIS UI中，您不能将同一证书绑定到多个站点，因此我们必须使用appcmd命令行界面将多个站点绑定到同一证书。

After looking more closely at our IIS logs it appears that there is indeed something that coincides with this behavior. 在更仔细地查看我们的IIS日志后，似乎确实存在与此行为相吻合的地方。 We get an error of 200 0 64 which is the sc-win32-status 64: "the specified network name is no longer available". 我们收到错误200 0 64，即sc-win32-status 64：“指定的网络名称不再可用”。

Now our 2 IIS servers are hosted in the cloud on Sungard, and we are using a load balancer that they setup for us. 现在，我们的2个IIS服务器托管在Sungard的云中，并且我们正在使用它们为我们设置的负载平衡器。 It was our theory that the load balancer was "losing" the proper session id of the user when this 64 error occurs and has no idea where it was supposed to be. 我们的理论是，当发生这64个错误并且不知道应该在哪里时，负载平衡器将“丢失”用户的正确会话ID。

We ran some controlled tests. 我们进行了一些受控测试。 One group we took OFF the load balancer and sent them directly to one of the servers and another group used the load balancer but made sure to connect to the same server. 一组我们卸下了负载均衡器，然后将它们直接发送到其中一台服务器，另一组使用了负载均衡器，但确保连接到同一台服务器。 Both teams conducted the tests of trying to reproduce the error (which is to say we clicked a popup on the site over and over). 两个团队都进行了尝试以重现该错误的测试（也就是说，我们一遍又一遍地单击了站点上的弹出窗口）。

The results were interesting. 结果很有趣。 The group that was NOT on the load balancer NEVER received the "Service Unavailable" error! 不在负载平衡器上的组永远不会收到“服务不可用”错误！ BUT the logs indicated they were getting 64 errors 45 times. 但是日志表明他们收到了64次错误，共45次。 The group that WAS on the load balancer was able to produce the "Service Unavailable" message twice and the logs confirmed that there were exactly 2 instances of the 64 error that coincided to the exact moment that the errors were observed. 负载平衡器上的WAS小组能够两次生成“服务不可用”消息，并且日志确认64个错误的确切两个实例恰好与观察到错误的时刻相吻合。

So what does this mean? 那么这是什么意思？
1.) Load balancer has some settings "Sticky Sessions?" 1.）负载均衡器具有一些设置“ Sticky Sessions”？ that aren't keeping the sessions in right (but we can't find the right settings. It's not even our load balancer it's SunGard's). 并不能使会话保持正确的状态（但是我们找不到正确的设置。它甚至不是我们的负载均衡器，而是SunGard的）。 Anyone have any advice on these settings for ASP.NET? 对ASP.NET的这些设置有任何建议吗？

2.) 64 errors are a part of web life? 2.）64错误是网络生活的一部分？ We gave more cpu power to one of our Virtual IIS servers and received less 64 errors. 我们为一台虚拟IIS服务器赋予了更多的cpu功能，并减少了64个错误。 This is all I can come up with. 这就是我能想到的。 We've sunk too much time and money trying to solve this, but it appears that I have an option at least of taking people off the load balancer and just routing them to one or the other server and in addition I can at least beef up the server to handle more traffic and reduce the 64 errors. 为了解决这个问题，我们已经花费了太多时间和金钱，但看来我至少可以选择让人们离开负载平衡器，然后将他们路由到一个或另一个服务器，此外，我至少可以增强能力服务器以处理更多流量并减少64个错误。