Can get TLS certificates from cert-manager/letsencrypt for either testing or production enviroments in kubernetes, but not both

Question

I wrote a bash script to automate the deployment of an application in a kube.netes cluster using helm and kubectl. I use cert-manager to automate issuing and renewing of TLS certificates, obtained by letsencrypt , needed by the application itself.

The script can deploy the application in either one of many environments such as testing (test) and production (prod) using different settings and manifests as needed. For each environment I create a separate namespace and deploy the needed resources in it. In production I use the letsencrypt production server (spec.acme.server: https://acme-v02.api.letsencrypt.org/directory ) whereas, in any other env such as testing, I use the staging server (spec.acme.server: https://acme-staging-v02.api.letsencrypt.org/directory ). The hostnames I request the certificates for are a different set depending on the environment: xyz.test.mysite.tld in testing vs xyz.mysite.tld in production . I provide the same contact e-mail address for all environments.

Here the full manifest of the letsencrypt issuer for testing:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    email: operations@mysite.tld
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-staging-issuer-private-key
    solvers:
    - http01:
        ingress:
          class: public-test-it-it

And here the full manifest of the letsencrypt issuer for production:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    email: operations@mysite.tld
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-production-issuer-private-key
    solvers:
    - http01:
        ingress:
          class: public-prod-it-it

When I deploy the application the first time, either in test or prod environements, everything works as expected, and cert-manager gets the TLS certificates signed by letsencrypt (staging or production server respectively) and stored in secrets. But when I deploy the application in another environment (so that I have both test and prod running in parallel), cert-manager can't get the certificates signed anymore, and the chain certificaterequest->order->challenge stops at the challenge step with the following output:

kubectl describe challenge xyz-tls-certificate
...
Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for HTTP-01 challenge propagation: wrong status code '404', expected '200'
  State:       pending
Events:        <none>

and I can verify that indeed I get a 404 when trying to curl any of the challenges' URLs:

curl -v http://xyz.test.mysite.tld/.well-known/acme-challenge/IECcFDmQF_fzGKcA9hJvFGEWRjDCAE_fs8dnBXlr_wY
*   Trying vvv.xxx.yyy.zzz:80...
* Connected to xyz.test.mysite.tld (vvv.xxx.yyy.zzz) port 80 (#0)
> GET /.well-known/acme-challenge/IECcFDmQF_fzGKcA9hJvFGEWRjDCAE_fs8dnBXlr_wY HTTP/1.1
> Host: xyz.test.mysite.tld
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Thu, 21 Jul 2022 09:48:08 GMT
< content-length: 21
< content-type: text/plain; charset=utf-8
< 
* Connection #0 to host xyz.test.mysite.tld left intact
default backend - 404

So letsencrypt can't access the challenges' URLs and won't sign the TLS certs.

I tried to debug the 404 error and found that I can successfully curl the pods and services backing the challenges from another pod running in the cluster/namespace, but I get 404s from the outside world. This seems like an issue with the ingress controller (HAProxytech/kube.netes-ingress in my case), but I can't explain why the mechanism worked upon first deployment and then not anymore..

I inspected the cert-manager logs and found lines such:

kubectl logs -n cert-manager cert-manager-...
I0721 13:27:45.517637       1 ingress.go:99] cert-manager/challenges/http01/selfCheck/http01/ensureIngress "msg"="found one existing HTTP01 solver ingress" "dnsName"="xyz.test.mysite.tld" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-8668s" "related_resource_namespace"="app-test-it-it" "related_resource_version"="v1" "resource_kind"="Challenge" "resource_name"="xyz-tls-certificate-hwvjf-2516368856-1193545890" "resource_namespace"="app-test-it-it" "resource_version"="v1" "type"="HTTP-01" 
E0721 13:27:45.527238       1 sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="wrong status code '404', expected '200'" "dnsName"="xyz.test.mysite.tld" "resource_kind"="Challenge" "resource_name"="xyz-tls-certificate-hwvjf-2516368856-1193545890" "resource_namespace"="app-test-it-it" "resource_version"="v1" "type"="HTTP-01"

which seems to confirm that cert-manager could self-check, from within the cluster, that the challenges' URLs are in place, but those are not reachable by the outside world (propagation check failed). It seems like cert-manager set-up challenges' pods/services/ingresses all right, but then requests sent to the challenges' URLs are not routed to the backing pods/services. And this only the second time I try to deploy the app..

I also verified that, after issuing the certificates upon the first deployment, cert-manager (correctly) removed all related pods/services/ingresses from the related namespace, so there should not be any conflict from duplicated challenges' resources.

I restate here that the certificates are issued flawlessly the first time I deploy the application, either in test or prod environment, but they won't be issued anymore if I deploy the app again in a different environment.

Any idea why this is the case?

Answer 1

I finally found out what the issue was..

Basically, I was installing a separate HAProxy ingress controller (haproxytech/kube.netes-ingress) per environment (test/prod), and therefore each namespace had its own ingress controller which I referenced in my manifests. This should have worked in principle, but it turned out cert-manager could not reference the right ingress-controller upon setting up the letsencrypt challenges.

The solution consisted in creating a single HAproxy ingress controller (in its own separate namespace) to serve the whole cluster and be referenced by all other environments/namespaces. This way the challenges for both testing and production environment where correctly set-up by cert-manager and verified by letsencrypt, which signed the required certificates.

In the end I highly recommend using a single HAproxy ingress controller per cluster, installed in its own namespace. This configuration is less redundant and eliminates potential issues such as the one I faced.

Can get TLS certificates from cert-manager/letsencrypt for either testing or production enviroments in kubernetes, but not both

Question

1 answers

solution1
1 2022-07-25 10:04:41

Can get TLS certificates from cert-manager/letsencrypt for either testing or production enviroments in kubernetes, but not both

Question

1 answers

solution1 1 2022-07-25 10:04:41

solution1
1 2022-07-25 10:04:41