简体   繁体   中英

How to get GRPC's retry mechanism to work using grpc-java in Kubernetes cluster?

I have been attempting to get GRPC's load balancing working in my Java application deployed to a Kubernetes cluster but I have not been having too much success. There does not seem to be too much documentation around this, but from examples online I can see that I should now be able to use '.defaultLoadBalancingPolicy("round_robin")' when setting up the ManagedChannel (in later versions of GRPC Java lib).

To be more specific, I am using version 1.34.1 of the GRPC Java libraries. I have created two Spring Boot (v2.3.4) applications, one called grpc-sender and one called grpc-receiver.

grpc-sender acts as a GRPC client and defines a (Netty) ManagedChannel as:

@Bean
public ManagedChannel greetingServiceManagedChannel() {
  String host = "grpc-receiver";
  int port = 6565;
  return NettyChannelBuilder.forAddress(host, port)
      .defaultLoadBalancingPolicy("round_robin")
      .usePlaintext().build();
}

Then grpc-receiver acts as the GRPC server:

Server server = ServerBuilder.forPort(6565)
        .addService(new GreetingServiceImpl()).build();

I am deploying these applications to a Kubernetes cluster (running locally in minikube for the time being), and I have created a Service for the grpc-receiver application as a headless service, so that GRPC load balancing can be achieved.

To test failed requests, I do two things:

  • kill one of the grpc-receiver pods during the execution of a test run - eg when I have requested grpc-sender to send, say, 5000 requests to grpc-receiver. Grpc-sender does detect that the pod has been killed and does refresh its list of receiver pods, and routes future requests to the new pods. As expected, some of the requests that were in flight during the kill of the pod fail with GRPC Status UNAVAILABLE.
  • have some simple logic in grpc-receiver that generates a random number and if that random number is below, say, 0.2, return Grpc Status INTERNAL rather than OK.

With both the above, I can get a proportion of the requests during a test run to fail. Now what I am trying to get GRPC's retry mechanism to work. From reading the sparse documentation I am doing the following:

return NettyChannelBuilder.forAddress(host, port)
        .defaultLoadBalancingPolicy("round_robin")
        .enableRetry()
        .maxRetryAttempts(10)
        .usePlaintext().build();

However this seems to have no effect and I cannot see that failed requests are retried at all.

I see that this is still marked as an @ExperimentalApi feature, so should it work as expected and has it been implemented?

If so, is there something obvious I am missing? Anything else I need to do to get retries working?

Any documentation that explains how to do this in more detail?

Thanks very much in advance...

ManagedChannelBuilder.enableRetry().maxRetryAttempts(10) is not sufficient to make retry happen. The retry needs a service config with RetryPolicy defined. One way is set a default service config with RetryPolicy, please see the retry example in https://github.com/grpc/grpc-java/tree/v1.35.0/examples

There's been some confusion on the javadoc of maxRetryAttempts(), and it's being clarified in https://github.com/grpc/grpc-java/pull/7803

Thanks very much @user675693: That worked perfectly :)

The working of maxRetryAttempts() is indeed a bit confusing.

From the documentation I can see that:

"maxAttempts MUST be specified and MUST be a JSON integer value greater than 1. Values greater than 5 are treated as 5 without being considered a validation error."

Referring to the maxAttempts in the service config. If we want more than 5 attempts I can set this as maxRetryAttempts(10) for example in my ManagedChannel set up:

return NettyChannelBuilder.forAddress(host, port)
        .defaultLoadBalancingPolicy("round_robin")
        .defaultServiceConfig(config)
        .enableRetry()
        .maxRetryAttempts(10)
        .usePlaintext().build();

But for that setting to be used properly I need to set it as 10 in the service config AND the ManagedChannel setup code, otherwise only 5 retries are performed. Its not clear from the Javadoc or the documentation, but thats what seems to happen from my testing.

Also, this retry functionality is marked as @ExperimentalApi. How mature is it, is it suitable to be used in production? Is it likely to change drastically?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM