如何使用 Spark Streaming 处理从 Kafka Topic 读取的数据帧

Question

I'm able to stream twitter data into my Kafka topic via a producer.我能够通过生产者将 Twitter 数据流式传输到我的 Kafka 主题中。 When I try to consume through the default Kafka consumer I'm able to see the tweets as well.当我尝试通过默认的 Kafka 消费者进行消费时，我也能够看到这些推文。

But when I try to use Spark Streaming to consume this and process further, I'm unable to find resources to refer.但是当我尝试使用 Spark Streaming 来使用它并进一步处理时，我无法找到可供参考的资源。 This is how my consumer looks like:这就是我的消费者的样子：

from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.appName('LinkitTest').getOrCreate()

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "tweets") \
  .option("startingOffsets", "earliest") \
  .load()

#df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

print(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)"))

query = df.writeStream.format("console").start()
import time
time.sleep(10) # sleep 10 seconds
query.stop()

Even when I do spark-submit I see the tweets in the topic but the value aren't readable即使当我执行spark-submit时，我也会看到主题中的推文，但值不可读

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 kafka_consumer.py spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 kafka_consumer.py

I'm unable to figure out how to at least print the column values (or the tweets in this case) with the dataframe I have?我无法弄清楚如何使用我拥有的数据框至少打印列值（或本例中的推文）？ Any help could be appriecated可以申请任何帮助

UPDATE更新

I was able to print the values on the console, but as you see its not readable.我能够在控制台上打印这些值，但如您所见，它不可读。 How can I convert this to a readable String?如何将其转换为可读的字符串？

query = df.select(col("value"))\
  .writeStream\
  .format("console")\
  .start()

Answer 1

Instead of代替

print(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)"))

You want你要

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").show()

But that is for batch dataframes, not streaming ones.但那是针对批处理数据帧，而不是流式数据帧。 For streaming ones, then you need to cast before writing.对于流式传输，则需要在写入之前进行投射。

df.select(col("value").cast("string"))\
  .writeStream\
  .format("console")\

Twitter data into Kafka via a producer推特数据通过生产者进入卡夫卡

You don't need Spark for this.为此，您不需要 Spark。 You can use tweepy and kafka-python directly.你可以直接使用 tweepy 和kafka-python 。

如何使用 Spark Streaming 处理从 Kafka Topic 读取的数据帧

问题描述

1 个解决方案

解决方案1
1 2022-12-15 16:06:11

如何使用 Spark Streaming 处理从 Kafka Topic 读取的数据帧

问题描述

1 个解决方案

解决方案1 1 2022-12-15 16:06:11

解决方案1
1 2022-12-15 16:06:11