简体   繁体   中英

What causes 'Cloud Run error: Internal system error, system will retry later'? Suggestions for troubleshooting?

I'm attempting to deploy a Cloud Run Service as part of tests for my open source project. This is done via our automated CI/CD system and has worked successfully hundreds of times previously.

The Cloud Run Service gets created but the first revision never gets deployed. When I look at the newly created Service in the GCP Console, it shows "Cloud Run error: Internal system error, system will retry later." as the main status message for the Service.

The command line that is failing is:

gcloud --configuration=adapt-cloud-gcloud-testing --quiet run deploy cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb --platform=managed --format=json --no-allow-unauthenticated --memory=128M --cpu=1 --image=gcr.io/adapt-ci/http-echo --region=us-central1 --port=5678 --set-env-vars=ADAPT_TEST_DEPLOY_ID=MockDeploy-aymb --args="-text,Adapt Test"

The output from that command (note: the dots after Creating Revision just keep going):

Deploying container to Cloud Run service [cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb] in project [adapt-ci] region [us-central1]
Deploying new service...
Creating Revision....................................................................................................................

The YAML tab in the Console also shows the same message for each of the three status conditions (see below).

To troubleshoot, I have also tried:

  • Using the GCP Console to create the most basic Cloud Run Service using the example container from the getting started docs manually, while logged in as the project and organization owner. I see the same failure. I have created Services manually this way previously, with this account and project, with no issues.
  • Using the GCP Console to create the same example Service as above in a different project , but with the same user and in the same org. This works successfully, so the issue is specific to the project.
  • I tried two different US regions with the same results.
  • Since this is typically automated, I attempted to look for any exceeded quotas. On the Cloud Run quotas page and the overall quotas page, I don't see any exceeded quotas now or historically. However, this is an area I'm not super familiar with, so may have missed something.
  • Retrying dozens of times over the course of two days.
  • The GCP status page shows no outages.

What are additional troubleshooting steps I should take to investigate & fix this issue?

Partial info from the YAML tab in the GCP Console for the failing Service:

status:
  observedGeneration: 1
  conditions:
  - type: Ready
    status: Unknown
    message: 'Cloud Run error: Internal system error, system will retry later.'
    lastTransitionTime: '2020-10-08T21:07:20.844314Z'
  - type: ConfigurationsReady
    status: Unknown
    message: 'Cloud Run error: Internal system error, system will retry later.'
    lastTransitionTime: '2020-10-08T21:07:20.755212Z'
  - type: RoutesReady
    status: Unknown
    message: 'Cloud Run error: Internal system error, system will retry later.'
    lastTransitionTime: '2020-10-08T21:07:20.844314Z'
  latestCreatedRevisionName: cloud-run-gen-name-3bab80f75cfd57cf87ad89d9d2c18ba3-00001-fus

After quite a bit of trial and error, I got everything working again.

The first thing I did that made some progress was to disable the Cloud Run Admin API and re-enable it. After that change, I was able to create a service using the example container from the Console, logged in as the project owner. I was also able to create a service using the example container from the CLI, logged in as the CI service account. However, the original command from my question still had identical behavior as before. I have no idea how the project got in this state, such that the project owner couldn't use Cloud Run.

The second thing I did was to re-push the container image I was trying to use ( gcr.io/adapt-ci/http-echo ) to GCR. I pushed the exact same image as was there previously. This finally allowed the CI system to successfully create the Service.

As part of my earlier troubleshooting, I had looked at Google Container Registry for this project and had confirmed that the needed image was still present. However, we had somewhat recently enabled a lifecycle policy on the Cloud Storage bucket to delete items older than a certain amount of time. So my best guess is that policy deleted some, but not all of the files associated with the gcr.io/adapt-ci/http-echo image and this resulted in the internal error instead of an error saying that the container image couldn't be found.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM