简体   繁体   中英

Is structured streaming the only option for python + spark 3.1.1 + kafka?

The doc for streaming integration doesn't contain python section. Does this mean python is not supported?

On the other hand, in structured streaming , Kafka put everything into one or two columns (key and value) and sql operations have a little sense here out of the box. The only way to introduce pure Python processing are UDFs, which are expensive. Is this true?

Many people are using Structured Streaming with Kafka and don't have problems. The Spark is putting everything into the two columns for a reason that it's how the Kafka works (and other systems, like, EventHubs, Kinesis, etc.) - the both key & value are just binary blobs from Kafka point of view, and Kafka doesn't know anything about what is inside - it's up to the developer to decide what to put inside that blob - plain string, Avro, JSON, etc.

Typical workflow with Kafka & Structured Streaming looks as following (everything is done via Spark APIs, without need to use UDFs, and is very efficient):

  • read data with spark.readStream
  • cast value (and maybe key ) into specific type, like, string if JSON is used, or leave as binary if Avro is used
  • The depending on format:
  • Promote fields from payload into top-level of the dataframe

For example, for JSON as value:

json_schema = ... # put structure of your JSON payload here
df = spark.read\
  .format("kafka")\
  .options(**kafka_options)\
  .load()\
  .withColumn("value", F.col("value").cast("string"))\
  .withColumn("json", F.from_json(F.col("value"), json_schema)\
  .select("json.*", "*")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM