Pulsar function fails to deserialize message because of wrong schema type (JSON instead of AVRO)

Question

When running Pulsar in docker as standalone, we are facing this weird issue when deserializing the message in the specific case. We are using version 2.7.1.

We have a script creating topics and functions after which schema gets created for troublesome topic with type JSON. The whole schema is correct, but the type is not. This is all before sending any messages. We also enabled set-is-allow-auto-update-schema .

This, let's call it trouble-topic , is populated from 2 sources: ValidationFunction and a Spring Boot microservice.

ValidationFunction validates the message and if the message is valid it sends the mapped message to a topic which is consumed by Spring Boot microservice which then does some logic on it and sends it to trouble-topic , but if validation fails it sends message directly to trouble-topic .

When using sendAsync from Spring Boot microservice with the following producer, schema gets updated, has AVRO as a type, and TroubleFunction reading the trouble-topic works fine afterwards:

pulsarClient
    .newProducer(AvroSchema.of(OurClass.class))
    .topic(troubleTopicName))
    .create()

But if before that some messages fail validation, and the messages are sent directly to the trouble-topic before the above Producer is used, we get a parsing exception. We send the message from function in the following way:

context.newOutputMessage(troubleTopicName, AvroSchema.of(OurClass.class))
    .value(value)
    .sendAsync();

This does not update the schema type for some reason and the schema type is still JSON. I validated schema type on each of the steps using pulsar admin CLI. And when this happens before the microservice producer updates the schema type for the first time, TroubleFunction reading the trouble-topic fails with the following error:

11:43:49.322 [tenant/namespace/TroubleFunction-0] ERROR org.apache.pulsar.functions.instance.JavaInstanceRunnable - [tenant/namespace/TroubleFunction:0] Uncaught exception in Java Instance
org.apache.pulsar.client.api.SchemaSerializationException: com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, code 2)): only regular white space (\r, \n, \t) is allowed between tokens
 at [Source: (byte[])avro-serialized-msg-i-have-to-hide Parsing exception: cvc-complex-type.2.4.a: Invalid content was found starting with element 'ElementName'. One of '{"foo:bar":ElementName}' is expected."; line: 1, column: 2]

So my question is what is the difference between these two, and why sending the message from function does not update the schema type correctly? Is it not using the same Producer underneath? Also is there a way to fix this so that schema type is set on initialization or at least updated when the message is sent from a function?

Answer 1

First of all, credit where credit is due. I suppose this will be well documented one day, but right now it is not. I was fortunate enough to have an EAP version of Apache Pulsar in Action book where this example repository is being used to showcase some Pulsar functionality: https://github.com/david-streamlio/GottaEat

I highly recommend the book and going through those examples for everyone working with Pulsar, there was some mention on pulsar slack community that just yesterday it graduated from MEAP and it should be available in print edition as well rather soon so check it out. Also consider joining Pulsar slack as well.

Answer:

This is the piece of code that allowed me to understand how this is supposed to work:

Map<String, ConsumerConfig> inputSpecs = new HashMap<String, ConsumerConfig> ();
inputSpecs.put("persistent://orders/inbound/food-orders", 
    ConsumerConfig.builder().schemaType("avro").build());
FunctionConfig functionConfig = 
    FunctionConfig.builder()
        ...
        .inputSpecs(inputSpecs)
        ...
        .build();

Java code can be used to setup the function when using LocalRunner, but the same configuration can be achieved using pulsar admin cli (which we use) and rest api. You can use functions config file as well and specify it in the following way in the configuration yaml:

inputSpecs:
 $topicName:
  schemaType: AVRO

$topicName is like always in the following format: persistent://tenant/namespace/topic

Once you specify input specs for, in my case, TroubleFunction , the schema will be validly created with correct schema type and deserialization will work perfectly fine as well.

Pulsar function fails to deserialize message because of wrong schema type (JSON instead of AVRO)

Question

1 answers

solution1
0 2021-11-02 14:20:03

Pulsar function fails to deserialize message because of wrong schema type (JSON instead of AVRO)

Question

1 answers

solution1 0 2021-11-02 14:20:03

solution1
0 2021-11-02 14:20:03