简体   繁体   中英

Kafka different partitions are selected python confluent_kafka library v/s apache kafka Java

I am publishing same data (Topic, Key & Value) from python confluent_kafka library based Producer v/s Java apache library based producer but when messages checked on Kafka then they are published to different Kafka Partition.

I was expecting by default both these library will use same hash method (murmur2) on Key and will determine same partition when publishing message to Kafka, but looks like that is is not happening.

Is there flag or option that needs to be set on Python library so that it will use same algorithm and generate same (as Java library) Kafka partition OR is there any other python library that should be used to achieve this?

I found a way to force confluent_kafka Producer to use murmur2 algorithm to determine partition. You can set below parameter with value:

'partitioner': 'murmur2_random'

I'v had the same problem. You can do two things:

  • Change the algorithm on the Python side to "murmur2_random" However this might be a bit fragile
  • Forward your messed up topic to another topic written with a single language

Example solution

  • In the latest version of kafka-python (v2.0.2), the default partitioning algorithm for the producer is the same as the default java partitioning algorithm (murmur2) https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html "The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages"

  • By default, both python and java should convert the keys to strings and then encode them to bytes using utf8. Finally the bytes are used to compute the murmur2 hash.

  • In theory the same string should result in the same utf8 encoding on pretty much any machine/environment. So once the key strings are the same we should always get the same murmur2 hash and therefore the same partition. Regardless of whether the partition is calculated in java or python. That's my understanding anyways

  • A problem that I have seen is in the string conversion. In my case, the partition key is a dictionary object and converting the dictionary to a string resulted in different white spaces in the string across java and python

  • For example:

     import json partition_key = json.dumps({"id":123456}) print(partition_key) >>>>>>>>>>>>>>>>>>>>>>>>> '{"id": 123456}' # This result is different to the string we get from java # Using the widely used Jackson Serializer class in java we get '{"id":123456}' # Notice the extra space in the string generated by python
  • The way hashing works is that even a one character difference in a string will result in a different hash. If you have a small number of partitions, you could get lucky and get the same partition for similar strings. But its not guaranteed

  • To fix the issue with extra spaces in the python dict -> string conversion we can use the separators parameter in json.dumps. For example

    partition_key = json.dumps({"id":123456}, separators=(',', ':')) print(partition_key) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '{"id":123456}' # notice there is now no space between the key: and value
  • Once I did this, I got the same partitioning across both java and python.

  • Fixing the extra space, fixed the problem for me, but there maybe other differences for different keys. In general, I think the approach of ensuring that the java and python implementations convert the keys to the EXACT same string and then use the same encoding and partitioning algorithms should be a pretty general solution.

  • Of course having simpler keys. Eg not dictionaries, is probably a helpful design principle. But this is not always in your control

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM