![](/img/trans.png)
[英]Is there a way to set up structured streaming with pyspark from Kafka to Cassandra
[英]pySpark Structured Streaming from Kafka does not output to console for debugging
下面是我的代碼。 我嘗試了許多不同的選擇變體,但應用程序可以運行,但沒有顯示每秒寫入的消息。 我有一個 Spark Streaming 示例,它使用 pprint() 確認 kafka 實際上每秒都在獲取消息。 Kafka 中的消息是 JSON 格式的,請參閱字段/列標簽的架構:
from pyspark.sql.functions import *
from pyspark.sql.types import *
import statistics
KAFKA_TOPIC = "vehicle_events_fast_testdata"
KAFKA_SERVER = "10.2.0.6:2181"
if __name__ == "__main__":
print("NXB PySpark Structured Streaming with Kafka Demo Started")
spark = SparkSession \
.builder \
.appName("PySpark Structured Streaming with Kafka Demo") \
.master("local[*]") \
.config("spark.jars", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar,/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraLibrary", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.driver.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
schema = StructType() \
.add("WheelAngle", IntegerType()) \
.add("acceleration", IntegerType()) \
.add("heading", IntegerType()) \
.add("reading_time", IntegerType()) \
.add("tractionForce", IntegerType()) \
.add("vel_latitudinal", IntegerType()) \
.add("vel_longitudinal", IntegerType()) \
.add("velocity", IntegerType()) \
.add("x_pos", IntegerType()) \
.add("y_pos", IntegerType()) \
.add("yawrate", IntegerType())
# Construct a streaming DataFrame that reads from testtopic
trans_det_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_SERVER) \
.option("subscribe", KAFKA_TOPIC) \
.option("startingOffsets", "latest") \
.load() \
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
#(from_json(col("value").cast("string"),schema))
#Q1 = trans_det_df.select(from_json(col("value"), schema).alias("parsed_value"), "timestamp")
#Q2 = trans_det_d.select("parsed_value*", "timestamp")
query = trans_det_df.writeStream \
.format("console") \
.option("truncate","false") \
.start() \
.awaitTermination()
kafka.bootstrap.servers
是Kafka broker地址(默認端口 9092),而不是 Zookeeper(端口 2181)
另請注意,您的起始偏移量是最新的,因此您必須在啟動流應用程序后生成數據。
如果要查看現有主題數據,請使用最早的偏移量。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.