I have created a sample code to execute multiple queries. But I am getting output of only first query. In the logs I am able to see that all the queries are running.Not sure what I am doing wrong.
public class A extends D implements Serializable {
public Dataset<Row> getDataSet(SparkSession session) {
Dataset<Row> dfs = session.readStream().format("socket").option("host", hostname).option("port", port).load();
publish(dfs.toDF(), "reader");
return dfs;
}
}
public class B extends D implements Serializable {
public Dataset<Row> execute(Dataset<Row> ds) {
Dataset<Row> d = ds.select(functions.explode(functions.split(ds.col("value"), "\\s+")));
publish(d.toDF(), "component");
return d;
}
}
public class C extends D implements Serializable {
public Dataset<Row> execute(Dataset<Row> ds) {
publish(inputDataSet.toDF(), "console");
ds.writeStream().format("csv").option("path", "hdfs://hostname:9000/user/abc/data1/")
.option("checkpointLocation", "hdfs://hostname:9000/user/abc/cp").outputMode("append").start();
return ds;
}
}
public class D {
public void publish(Dataset<Row> dataset, String directory) {
dataset.writeStream().format("csv").option("path", "hdfs://hostname:9000/user/abc/" + directory)
.option("checkpointLocation", "hdfs://hostname:9000/user/abc/checkpoint/" + directory).outputMode("append")
.start();
}
}
public static void main(String[] args) {
SparkSession session = createSession();
try {
A a = new A();
Dataset<Row> records = a.getDataSet(session);
B b = new B();
Dataset<Row> ds = b.execute(records);
C c = new C();
c.execute(ds);
session.streams().awaitAnyTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
}
The problem is due to the input socket that you are reading from.Spark socket source opens two connections to nc(ie since you have two start). Its a limitation of nc that it can feed data to one connection only.For other input sources ,your query should run fine. See related question : Executing separate streaming queries in spark structured streaming
Tried a simple test like below and prints both the output:
val df1 = spark.readStream.format("socket").option("host","localhost").option("port",5430).load()
val df9 = spark.readStream.format("socket").option("host","localhost").option("port",5431).load()
val df2 = df1.as[String].flatMap(x=>x.split(","))
val df3 = df9.as[String].flatMap(x=>x.split(",")).select($"value".as("name"))
val sq1 = df3.writeStream.format("console").queryName("sq1")
.option("truncate","false").trigger(Trigger.ProcessingTime(10 second)).start()
val sq = df2.writeStream.format("console").queryName("sq")
.option("truncate","false").trigger(Trigger.ProcessingTime(20 second)).start()
spark.streams.awaitAnyTermination()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.