Is structured streaming the only option for python + spark 3.1.1 + kafka?

Question

The doc for streaming integration doesn't contain python section. Does this mean python is not supported?

On the other hand, in structured streaming , Kafka put everything into one or two columns (key and value) and sql operations have a little sense here out of the box. The only way to introduce pure Python processing are UDFs, which are expensive. Is this true?

Answer 1

Many people are using Structured Streaming with Kafka and don't have problems. The Spark is putting everything into the two columns for a reason that it's how the Kafka works (and other systems, like, EventHubs, Kinesis, etc.) - the both key & value are just binary blobs from Kafka point of view, and Kafka doesn't know anything about what is inside - it's up to the developer to decide what to put inside that blob - plain string, Avro, JSON, etc.

Typical workflow with Kafka & Structured Streaming looks as following (everything is done via Spark APIs, without need to use UDFs, and is very efficient):

read data with spark.readStream
cast value (and maybe key ) into specific type, like, string if JSON is used, or leave as binary if Avro is used
The depending on format:
- if JSON is used, use from_json function to decode string into Struct
- if Avro is used, use from_avro function
Promote fields from payload into top-level of the dataframe

For example, for JSON as value:

json_schema = ... # put structure of your JSON payload here
df = spark.read\
  .format("kafka")\
  .options(**kafka_options)\
  .load()\
  .withColumn("value", F.col("value").cast("string"))\
  .withColumn("json", F.from_json(F.col("value"), json_schema)\
  .select("json.*", "*")

Is structured streaming the only option for python + spark 3.1.1 + kafka?

Question

1 answers

solution1
1 ACCPTED 2021-04-06 13:39:12

Is structured streaming the only option for python + spark 3.1.1 + kafka?

Question

1 answers

solution1 1 ACCPTED 2021-04-06 13:39:12

solution1
1 ACCPTED 2021-04-06 13:39:12