简体繁体 English

Google Cloud Run 中的 PHP 服务突然停止响应所有传入请求

[英]PHP service in Google Cloud Run suddenly stops responding to all incoming requests

原文 2022-09-01 18:25:44 2 1 php/ mysql/ google-cloud-run

I am running a PHP (8.1) backend application in Google Cloud Run.我在 Google Cloud Run 中运行 PHP (8.1) 后端应用程序。 The backend is connected to a MYSQL database running in Google Cloud SQL.后端连接到在 Google Cloud SQL 中运行的 MYSQL 数据库。 Over the last two weeks, we have had three complete outages.在过去的两周里，我们经历了三次完全停电。 The backend server does not respond to any requests, resulting in our app and website being completely down.后端服务器不响应任何请求，导致我们的应用程序和网站完全关闭。

Before this happened the first time, the server has been running for many months without any similar problems.在第一次发生这种情况之前，服务器已经运行了好几个月，没有任何类似的问题。 I first suspected that this had something to do with some specs on either the backend or the database, but looking at the graphs, I cannot see any obvious reasons why it should all go down.我首先怀疑这与后端或数据库的某些规格有关，但查看图表，我看不出任何明显的原因为什么它应该全部 go 下来。

Notice how the traffic goes down every night, but is still spiky, before it went completely flat around 4PM today:请注意每天晚上的流量是如何下降的，但在今天下午 4 点左右完全平稳之前，仍然是尖峰的：

I then looked at the stats from our Cloud Run server to find any indication there.然后，我查看了 Cloud Run 服务器上的统计数据，以找到任何迹象。 We are running with relatively high specs, and a flexible container instance size, so this should not cause any troubles like this.我们运行的是相对较高的规格，以及灵活的容器实例大小，所以这不会造成这样的麻烦。 Container memory and CPU utilization drops dead all of a sudden.容器 memory 和 CPU 利用率突然下降。 It seems there are no unusual activity going on before the service decided to die.在服务决定终止之前，似乎没有发生异常活动。

Our Sentry dashboard shows that there are no captured events from the down time period.我们的 Sentry 仪表板显示没有从停机时间段捕获的事件。 However, looking at the logs of the backend service in Google Cloud Logs Explorer, it seems there are heaps of 200 responses in this time interval.但是，在 Google Cloud Logs Explorer 中查看后端服务的日志，在此时间间隔内似乎有 200 个响应。 By looking at the logs, I don't see any indication that anything is wrong.通过查看日志，我没有看到任何错误的迹象。

The only thing I could think of to resolve this problem, was to redeploy the service inside Google Cloud Run, effectively spinning up a new container with the exact same code and specs.为了解决这个问题，我唯一能想到的就是在 Google Cloud Run 中重新部署服务，有效地启动一个具有完全相同代码和规范的新容器。 Then it started working again, and has been working since, but I have no idea what have happened.然后它又开始工作了，从那以后一直在工作，但我不知道发生了什么。 As far as I can think of, we don't have any code or config related changes that could lead to any kind of problems like this.据我所知，我们没有任何可能导致此类问题的代码或配置相关更改。

Does anyone have any thoughts?有人有想法吗？ The only thing I can think of is some sort of memory leak that suddenly gets out of hand.我唯一能想到的是某种突然失控的 memory 泄漏。 But I assume that should have been able to trace back in some way.但我认为应该能够以某种方式追溯。 If this was the case, I'm also thinking it should have happened more often over a long period of time.如果是这样的话，我也认为它应该在很长一段时间内更频繁地发生。 Not running nicely for a long time, then down 3 times in 2 weeks.很长一段时间没有很好地运行，然后在 2 周内下降了 3 次。

Any help or pointers would be greatly appreciated!任何帮助或指示将不胜感激！

1 个解决方案

This error might be caused by expected internal behavior within the networking infrastructure of Cloud Run.此错误可能是由 Cloud Run 的网络基础架构中的预期内部行为引起的。 Occasionally, maintenance does occur on this networking infrastructure, which can cause some connections to be closed.有时，此网络基础架构会进行维护，这可能会导致某些连接关闭。 Since Cloud Run's health checks are not network dependent, this can cause the container to be stuck and unable to communicate with the server.由于 Cloud Run 的运行状况检查不依赖于网络，这可能会导致容器卡住并且无法与服务器通信。 Some Cloud Run applications are sensitive to this network connection drop as well.一些 Cloud Run 应用程序也对这种网络连接中断很敏感。 As you were able to redeploy the Cloud Run container, you were able to "unstick" the container and it was able to run as intended.由于您能够重新部署 Cloud Run 容器，因此您能够“解开”容器并且它能够按预期运行。

A recommended workaround and practice is to implement retry logic within your service code.推荐的解决方法和做法是在您的服务代码中实现重试逻辑。 This will allow the Cloud Run service to be able to reconnect to the servers when a network connection drops.这将允许 Cloud Run 服务能够在网络连接断开时重新连接到服务器。