简体   繁体   中英

Python: how to mock a kafka topic for unit tests?

We have a message scheduler that generates a hash-key from the message attributes before placing it on a Kafka topic queue with the key.

This is done for de-duplication purposes. However, I am not sure how I could possibly test this deduplication without actually setting up a local cluster and checking that it is performing as expected.

Searching online for tools for mocking a Kafka topic queue has not helped, and I am concerned that I am perhaps thinking about this the wrong way.

Ultimately, whatever is used to mock the Kafka queue, should behave the same way as a local cluster - ie provide de-deuplication with Key inserts to a topic queue.

Are there any such tools?

If you need to verify a Kafka specific feature, or implementation with a Kafka-specific feature, then the only way to do it is by using Kafka!

Does Kafka have any tests around its deduplication logic? If so, the combination of the following may be enough to mitigate your organization's perceived risks of failure:

  • unit tests of your hash logic (make sure that the same object does indeed generate the same hash)
  • Kafka topic deduplication tests (internal to Kafka project)
  • pre-flight smoke tests verifying your app's integration with Kafka

If Kafka does NOT have any sort of tests around its topic deduplication, or you are concerned about breaking changes, then it is important to have automated checks around Kafka-specific functionality. This can be done through integration tests. I have had much success recently with Docker-based integration test pipelines. After the initial legwork of creating a Kafka docker image (one is probably already available from the community), it becomes trivial to set up integration test pipelines. A pipeline could look like:

  • application-based unit tests are executed (hash logic)
  • once those pass, your CI server starts up Kafka
  • integration tests are executed, verifying that duplicate writes only emit a single message to a topic.

I think the important thing is to make sure Kafka integration tests are minimized to ONLY include tests that absolutely rely on Kafka-specific functionality. Even using docker-compose, they may be orders of magnitude slower than unit tests, ~1ms vs 1 second? Another thing to consider is the overhead of maintaining an integration pipeline may be worth the risk of trusting that Kakfa will provide the topic deduplication that it claims to.

To mock Kafka under Python unit tests with SBT test tasks I did as below. Pyspark should be installed.

in build.sbt define task that should be run with tests:

val testPythonTask = TaskKey[Unit]("testPython", "Run python tests.")

val command = "python3 -m unittest app_test.py"
val workingDirectory = new File("./project/src/main/python")

testPythonTask := {
  val s: TaskStreams = streams.value
  s.log.info("Executing task testPython")
  Process(command,
    workingDirectory,
    // arguments for using org.apache.spark.streaming.kafka.KafkaTestUtils in Python
    "PYSPARK_SUBMIT_ARGS" -> "--jars %s pyspark-shell"
      // collect all jar paths from project
      .format((fullClasspath in Runtime value)
      .map(_.data.getCanonicalPath)
        .filter(_.contains(".jar"))
        .mkString(",")),
    "PYSPARK_PYTHON" -> "python3") ! s.log
}

//attach custom test task to default test tasks
test in Test := {
  testPythonTask.value
  (test in Test).value
}

testOnly in Test := {
  testPythonTask.value
  (testOnly in Test).value
}

in python testcase (app_test.py):

import random
import unittest
from itertools import chain

from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.tests import PySparkStreamingTestCase

class KafkaStreamTests(PySparkStreamingTestCase):
    timeout = 20  # seconds
    duration = 1

    def setUp(self):
        super(KafkaStreamTests, self).setUp()

        kafkaTestUtilsClz = self.ssc._jvm.java.lang.Thread.currentThread().getContextClassLoader()\
            .loadClass("org.apache.spark.streaming.kafka.KafkaTestUtils")
        self._kafkaTestUtils = kafkaTestUtilsClz.newInstance()
        self._kafkaTestUtils.setup()

    def tearDown(self):
        if self._kafkaTestUtils is not None:
            self._kafkaTestUtils.teardown()
            self._kafkaTestUtils = None

        super(KafkaStreamTests, self).tearDown()

    def _randomTopic(self):
        return "topic-%d" % random.randint(0, 10000)

    def _validateStreamResult(self, sendData, stream):
        result = {}
        for i in chain.from_iterable(self._collect(stream.map(lambda x: x[1]),
                                                   sum(sendData.values()))):
            result[i] = result.get(i, 0) + 1

        self.assertEqual(sendData, result)

    def test_kafka_stream(self):
        """Test the Python Kafka stream API."""
        topic = self._randomTopic()
        sendData = {"a": 3, "b": 5, "c": 10}

        self._kafkaTestUtils.createTopic(topic)
        self._kafkaTestUtils.sendMessages(topic, sendData)

        stream = KafkaUtils.createStream(self.ssc, self._kafkaTestUtils.zkAddress(),
                                         "test-streaming-consumer", {topic: 1},
                                         {"auto.offset.reset": "smallest"})
        self._validateStreamResult(sendData, stream)

More examples for Flume, Kinesis and other in pyspark.streaming.tests module.

Here is an example of automated test in Python for Kafka-related functionality: https://github.com/up9inc/async-ms-demo/blob/main/grayscaler/tests.py

It uses "Kafka Mock" capability of http://mockintosh.io project.

Disclaimer: I'm affiliated with that project.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM