简体   繁体   English

最小化Azure中的停机时间

[英]Minimize downtime in Azure

We are experiencing a very serious unscheduled downtime of our Azure application today for what is now coming up to 9 hours. 今天,我们的Azure应用程序正在经历非常严重的计划外停机,而现在正要花费9个小时。 We reported to Azure support and the ops team is actively trying to fix the problem and I do not doubt that. 我们已向Azure支持人员报告,操作团队正在积极尝试解决此问题,我对此并不怀疑。 We managed to get our application running on another "test" hosted service that we have and redirected our CNAME to point at the instance so our customers are happy, but the "main" hosted service is still unavailable. 我们设法使我们的应用程序在我们拥有的另一个“测试”托管服务上运行,并重定向了CNAME以指向实例,以便我们的客户满意,但“主要”托管服务仍然不可用。

My own "finger in the air" instinct is that the issue is network related within our data center (west europe), and indeed, later on in the day the service dash board has gone red for that region with a message to that effect. 我本人的“空中手指”本能是,问题与我们的数据中心(西欧)内的网络有关,实际上,当天晚些时候,该区域的服务仪表板已经变成红色,并带有相应的信息。 (Our application is showing as "Healthy" in the portal, but is unreachable via our cloudapp.net URL. Additionally threads within our application are logging sql connection exceptions into our storage account as it cannot contact the DB) (我们的应用程序在门户中显示为“健康”,但无法通过cloudapp.net URL访问。此外,我们应用程序中的线程正在将sql连接异常记录到我们的存储帐户中,因为它无法联系数据库)

What is very strange, though, is that the "test" instance I referred to above is also in the same data centre and has no issues contacting the DB and it's external endpoint is fully available. 但是,非常奇怪的是,我上面提到的“测试”实例也位于同一数据中心,并且与数据库联系没有问题,并且其外部端点完全可用。

I would like to ask the community if there is anything that I could have done better to avoid this downtime? 我想问一下社区,是否有什么我可以做得更好的方法来避免停机? I obeyed the guidance with respect to having at least 2 roles instances per role, yet I still got burned. 我遵守有关每个角色至少有两个角色实例的指导,但是我仍然很生气。 Should I move to a more reliable data centre? 我应该搬到一个更可靠的数据中心吗? Should I deploy my application to multiple data centres? 我应该将我的应用程序部署到多个数据中心吗? How would I manage the fact that my SQL-Azure DB is in the same datacentre? 我如何管理我的SQL-Azure数据库位于同一数据中心的事实?

Any constructive guidance would be appreciated - being a techie, I've never had a more frustrating day being able to do nothing to help fix the issue. 任何建设性的指导将不胜感激-作为一名技术人员,我从未有过更沮丧的一天无法采取任何措施来解决此问题。

There was an outage in the European data center today with respect to SQL Azure. 今天,欧洲数据中心因SQL Azure发生了故障。 Some of our clients got hit and had to move to another data center. 我们的一些客户受到打击,不得不搬到另一个数据中心。

If you are running mission critical applications that cannot be down, I would deploy the application into multiple regions. 如果您正在运行无法关闭的任务关键型应用程序,那么我会将其部署到多个区域。 DNS resolution is obviously a weak link right now in Azure, but can be worked around (if you only run a website it can be done very simply using Response.Redirects or similar) DNS解析现在显然是Azure中的薄弱环节,但是可以解决(如果您仅运行网站,则可以使用Response.Redirects或类似方法非常简单地完成)

Now, there is a data synchronization service from Microsoft that will sync up multiple SQL Azure databases. 现在,Microsoft提供了一个数据同步服务,该服务将同步多个SQL Azure数据库。 Check here . 在这里检查。 This way, you can have mirror sites up in different regions and have them be in sync with SQL Azure perspective 这样,您可以在不同区域中建立镜像站点,并使它们与SQL Azure透视图同步

Also, be a good idea to employ a 3rd party monitoring service that would detect problems with your deployed instances externally. 另外,最好使用第三方监视服务,该服务将在外部检测已部署实例的问题。 AzureWatch can notify or even deploy new nodes if you choose to, when some of the instances turn "Unresponsive" 如果您选择的话,当某些实例变为“无响应”时, AzureWatch可以通知甚至部署新节点

Hope this helps 希望这可以帮助

I can offer some guidance based on our experience: 我可以根据我们的经验提供一些指导:

  1. Host your application in multiple data centers, complete with Sql Azure databases. 在多个数据中心中托管您的应用程序,并带有Sql Azure数据库。 You can connect each application to its data center specific Sql Server. 您可以将每个应用程序连接到其数据中心特定的Sql Server。 You can also cache any external assets (images/JS/CSS) on the data center specific Windows Azure machine or leverage Azure Blog Storage. 您还可以在特定于数据中心的Windows Azure计算机上缓存任何外部资产(图像/ JS / CSS),或利用Azure Blog Storage。 Note: Extra costs will be incurred. 注意:将产生额外费用。
  2. Setup one-way SQL replication between your primary Sql Azure DB and the instance in the other data center. 在主Sql Azure数据库和另一个数据中心中的实例之间设置单向SQL复制。 If you want to do bi-rectional replication, take a look at the MSDN site for guidance. 如果要进行双向复制,请查看MSDN站点以获取指导。
  3. Leverage Azure Traffic Manager to route traffic to the data center closest to the user. 利用Azure Traffic Manager将流量路由到距离用户最近的数据中心。 It has geo-detection capabilities which will also improve the latency of your application. 它具有地理检测功能,这也将改善应用程序的延迟。 So you can redirect map http://myapp.com to the internal url of your data center and a user in Europe should automatically get redirected to the European data center and vice versa for USA. 因此,您可以将地图http://myapp.com重定向到数据中心的内部URL,欧洲的用户应自动重定向到欧洲数据中心,反之亦然。 Note: At the time of writing this post, there is not a way to automatically detect and failover to a data center. 注意:在撰写本文时,尚无法自动检测并故障转移到数据中心。 Manual steps will be involved, once a failover is detected and failover is a complete set (ie you will failover both the Windows Azure AND Sql Azure instances). 一旦检测到故障转移且故障转移已完成,将涉及手动步骤(即,您将对Windows Azure和Sql Azure实例进行故障转移)。 If you want micro-level failover, then I suggest putting all your config the in the service config file and encrypt the values so you can edit the connection string to connect instance X to DB Y. 如果要进行微级故障转移,则建议将所有配置都放在服务配置文件中并加密值,以便您可以编辑连接字符串以将实例X连接到数据库Y。
  4. You are all set now. 你们都准备好了。 I would create or install a local application to detect the availability of the site. 我将创建或安装本地应用程序以检测站点的可用性。 A better solution would be to create a page to check for the availability of application specific components by writing a diagnostic page or web service and then poll it from a local computer. 更好的解决方案是创建一个页面,通过编写诊断页面或Web服务来检查应用程序特定组件的可用性,然后从本地计算机对其进行轮询。

HTH 高温超导

As you're deploying to Azure you don't have much control about how SQL server is setup. 在部署到Azure时,您对SQL Server的设置没有太多控制。 MS have already set it up so that it is highly available. MS已经对其进行了设置,以使其高度可用。

Having said that, it seems that MS has been having some issues with SQL Azure over the last few days. 话虽如此,MS在过去几天似乎一直在与SQL Azure发生问题。 We've been told that it only affected "a small number of users" . 有人告诉我们,它仅影响“少数用户” At one point the service dashboard had 5 data centres affected by a problem. 某一时刻, 服务仪表板有5个受问题影响的数据中心。 I had 3 databases in one of those data centres down twice for about an hour each time, but one database in another affected data centre that had no interruption. 我在其中一个数据中心中有3个数据库,每次都关闭了两次,每次大约一个小时,但是在另一个受影响的数据中心中的一个数据库却没有中断。

If having a database connection is critical to your app, then the only way in the Azure environment to ensure against problems that MS haven't prepared against (this latest technical problem, earthquakes, meteor strikes) would be to co-locate your sql data in another data centre. 如果数据库连接对您的应用程序至关重要,则在Azure环境中确保避免MS尚未针对其解决的问题(此最新技术问题,地震,流星撞击)的唯一方法是将sql数据共存在另一个数据中心。 At the moment the most practical way to do this is to use the synch framework . 目前,最实用的方法是使用synch框架 There is an ability to copy SQL Azure databases , but this only works within a data centre. 可以复制SQL Azure数据库 ,但这仅在数据中心内有效。 With your data located elsewhere you could then point your app at the new database if the main one becomes unavailable. 将数据放在其他位置时,如果主数据库不可用,则可以将应用程序指向新数据库。

While this looks good on paper though, this may not have helped you with the latest problem as it did affect multiple data centres. 尽管在纸面上看起来不错,但这可能并没有帮助您解决最新的问题,因为它确实影响了多个数据中心。 If you'd just been making database copies on a regular basis, that might have been enough to get you through. 如果您只是定期制作数据库副本,那可能足以使您顺利完成工作。 Or not. 或不。

(I would have posted this answer on server fault, but I couldn't find the question) (我会在服务器故障时发布此答案,但找不到问题)

This is just about a programming/architecture issue, but you amy also want to ask the question on webmasters.stackexchange.com 这只是有关编程/体系结构的问题,但是您艾米也想在webmasters.stackexchange.com上提问

You need to find out the root cause before drawing any conclusions. 在得出任何结论之前,您需要找出根本原因。

However. 然而。 my guess one of two things was the problem 我猜两件事之一是问题

  • The ISP connectivity differs for the test system and your production system. 对于测试系统和生产系统,ISP连接有所不同。 Either they use different ISPs, or different lines from the same ISP. 他们要么使用不同的ISP,要么使用同一ISP的不同线路。 When I worked in a hosting company we made sure that ou IP connectivity went through at least two different ISPS who did not share fibre to our premises (and where we could, they had different physical routes to the building - the homing ability of backhoes when there's a critical piece of fibre to dig up is well proven 当我在一家托管公司工作时,我们确保IP连接至少经过两个不同的ISPS,它们不共享光纤到我们的房屋(并且在可能的情况下,他们到建筑物的物理路径不同-反铲的归巢能力有一条关键的纤维可以挖掘

  • Your datacentre had an issue with some shared production infrastructure. 您的数据中心在某些共享生产基础架构上存在问题。 These might be edge routers, firewalls, load balancers, intrusion detection systems, traffic shapers etc. These typically are also often only installed on production systems. 这些可能是边缘路由器,防火墙,负载平衡器,入侵检测系统,流量整形器等。这些通常也通常仅安装在生产系统上。 Defences here involve understanding the architecture and making sure the provider has a (tested!) DR plan for restoring SOME service when things go pair shaped. 这里的防御措施包括了解体系结构并确保提供程序具有(经过测试!)DR计划,以便在事情成对发生时恢复某些服务。 Neatest hack I saw here was persuading an IPS (intrusion prevention system) that its own management servers were malicious. 我在这里看到的Neatest黑客正在说服一个IPS(入侵防御系统),说它自己的管理服务器是恶意的。 And so you couldn't reconfigure it at all. 因此,您根本无法重新配置它。

Just a thought - your DC doesn't host any of the Wikileaks mirrors, or Paypal/Mastercard/Amazon (who are getting DDOS'd by wikileaks supporters at the moment)? 只是一个想法-您的DC没有托管任何Wikileaks镜像或Paypal / Mastercard / Amazon(目前谁正在接受Wikileaks支持者的DDOS)?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM