简体   繁体   中英

Multiple writeStreams in Spark Structured Streaming (Pyspark)

I have been successful in implementing a single writeStream in Pyspark - but once I add a second writeStream , only the first gets printed to the console. Here is my code:

import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set("spark.scheduler.allocation.file", "file:///opt/spark/conf/fairscheduler.xml")

spark = SparkSession \
    .builder \
    .appName("SparkStreaming") \
    .config(conf=conf) \
    .getOrCreate()

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")

schema = StructType([ 
    StructField("text", StringType(), True),
    StructField("created_at" , TimestampType(), True)
    ])

tweets_df1 = spark \
    .readStream \
    .format("socket") \
    .option("host", "127.0.0.1") \
    .option("port", 9999) \
    .load() \
    .select(F.from_json(F.col("value").cast("string"), schema).alias("tmp")).select("tmp.*")

q1 = tweets_df1 \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/home/ubuntu/apache-spark-streaming-twitter-1/chk1") \
    .trigger(processingTime='5 seconds') \
    .start()

q2 = tweets_df1 \
    .withColumn("foo", F.lit("foo")) \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("checkpointLocation", "/home/ubuntu/apache-spark-streaming-twitter-1/chk2") \
    .trigger(processingTime='5 seconds') \
    .start()

spark.streams.awaitAnyTermination()

And here is my output:

-------------------------------------------
Batch: 0
-------------------------------------------
-------------------------------------------
Batch: 0
-------------------------------------------
+----+----------+
|text|created_at|
+----+----------+
+----+----------+

+----+----------+---+
|text|created_at|foo|
+----+----------+---+
+----+----------+---+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+-------------------+
|                text|         created_at|
+--------------------+-------------------+
|Qatar posting for...|2022-12-16 20:23:06|
+--------------------+-------------------+

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------+-------------------+
|                text|         created_at|
+--------------------+-------------------+
|Who will win this...|2022-12-16 20:23:13|
+--------------------+-------------------+

The dataframes with column foo stop after batch 0 - meaning that second writeStream is not running. I can confirm this with the checkpoint folder for each writeStream . Most of the solutions to this problem are in Scala and I have tried to translate them to Pyspark.

Is this just something that is not possible in Pyspark?

Most probably this happens because you can consume from socket only one time, so one of the streams is "winning". If you want to have multiple consumers, consider to put your messages into something "durable", for example, into Kafka - then each stream would be able to consume messages independently of each other.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM