[英]What causes 'Cloud Run error: Internal system error, system will retry later'? Suggestions for troubleshooting?
I'm attempting to deploy a Cloud Run Service as part of tests for my open source project.我正在尝试部署 Cloud Run 服务作为我的开源项目测试的一部分。 This is done via our automated CI/CD system and has worked successfully hundreds of times previously.
这是通过我们的自动化 CI/CD 系统完成的,并且之前已经成功运行了数百次。
The Cloud Run Service gets created but the first revision never gets deployed. Cloud Run 服务已创建,但第一个修订版从未部署过。 When I look at the newly created Service in the GCP Console, it shows "Cloud Run error: Internal system error, system will retry later."
当我在 GCP 控制台中查看新创建的服务时,它显示“Cloud Run 错误:内部系统错误,系统将稍后重试。” as the main status message for the Service.
作为服务的主要状态消息。
The command line that is failing is:失败的命令行是:
gcloud --configuration=adapt-cloud-gcloud-testing --quiet run deploy cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb --platform=managed --format=json --no-allow-unauthenticated --memory=128M --cpu=1 --image=gcr.io/adapt-ci/http-echo --region=us-central1 --port=5678 --set-env-vars=ADAPT_TEST_DEPLOY_ID=MockDeploy-aymb --args="-text,Adapt Test"
The output from that command (note: the dots after Creating Revision
just keep going):该命令中的 output(注意:
Creating Revision
之后的点继续显示):
Deploying container to Cloud Run service [cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb] in project [adapt-ci] region [us-central1]
Deploying new service...
Creating Revision....................................................................................................................
The YAML tab in the Console also shows the same message for each of the three status conditions (see below).控制台中的 YAML 选项卡还针对三种状态条件中的每一种显示相同的消息(见下文)。
To troubleshoot, I have also tried:为了排除故障,我还尝试过:
What are additional troubleshooting steps I should take to investigate & fix this issue?我应该采取哪些额外的故障排除步骤来调查和解决此问题?
Partial info from the YAML
tab in the GCP Console for the failing Service: GCP Console 中
YAML
选项卡中失败服务的部分信息:
status:
observedGeneration: 1
conditions:
- type: Ready
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.844314Z'
- type: ConfigurationsReady
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.755212Z'
- type: RoutesReady
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.844314Z'
latestCreatedRevisionName: cloud-run-gen-name-3bab80f75cfd57cf87ad89d9d2c18ba3-00001-fus
After quite a bit of trial and error, I got everything working again.经过多次试验和错误后,我让一切重新开始工作。
The first thing I did that made some progress was to disable the Cloud Run Admin API and re-enable it.我所做的取得一些进展的第一件事是禁用 Cloud Run Admin API 并重新启用它。 After that change, I was able to create a service using the example container from the Console, logged in as the project owner.
更改之后,我能够使用控制台中的示例容器创建服务,并以项目所有者身份登录。 I was also able to create a service using the example container from the CLI, logged in as the CI service account.
我还能够使用 CLI 中的示例容器创建服务,以 CI 服务帐户登录。 However, the original command from my question still had identical behavior as before.
但是,我的问题中的原始命令仍然具有与以前相同的行为。 I have no idea how the project got in this state, such that the project owner couldn't use Cloud Run.
不知道这个state这个项目是怎么搞到的,导致项目主无法使用Cloud Run。
The second thing I did was to re-push the container image I was trying to use ( gcr.io/adapt-ci/http-echo
) to GCR.我做的第二件事是将我尝试使用的容器映像 (
gcr.io/adapt-ci/http-echo
) 重新推送到 GCR。 I pushed the exact same image as was there previously.我推送了与之前完全相同的图片。 This finally allowed the CI system to successfully create the Service.
这最终让 CI 系统成功创建了 Service。
As part of my earlier troubleshooting, I had looked at Google Container Registry for this project and had confirmed that the needed image was still present.作为我之前故障排除的一部分,我查看了该项目的 Google Container Registry,并确认所需的图像仍然存在。 However, we had somewhat recently enabled a lifecycle policy on the Cloud Storage bucket to delete items older than a certain amount of time.
但是,我们最近在 Cloud Storage 存储桶上启用了生命周期策略,以删除超过一定时间的项目。 So my best guess is that policy deleted some, but not all of the files associated with the
gcr.io/adapt-ci/http-echo
image and this resulted in the internal error instead of an error saying that the container image couldn't be found.所以我最好的猜测是政策删除了一些但不是所有与
gcr.io/adapt-ci/http-echo
图像相关的文件,这导致了内部错误而不是错误说容器图像不能被发现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.