简体   繁体   English

Spark Structured Streaming 加入 csv 文件流和速率流每批次时间过长

[英]Spark Structured Streaming joins csv file stream and rate stream too much time per batch

I have rate and csv file streams joining on rat values and csv file id:我有速率和 csv 文件流加入大鼠值和 csv 文件 ID:

def readFromCSVFile(path: String)(implicit spark: SparkSession): DataFrame =  {
    val schema = StructType(
        StructField("id", LongType, nullable = false) ::
        StructField("value1", LongType, nullable = false) ::
        StructField("another", DoubleType, nullable = false) :: Nil)
  val spark: SparkSession = SparkSession
  .builder
  .master("local[1]")
  .config(new SparkConf().setIfMissing("spark.master", "local[1]")
  .set("spark.eventLog.dir", "file:///tmp/spark-events")
  ).getOrCreate()

   spark
      .readStream
      .format("csv")
      .option("header", value=true)
      .schema(schema)
      .option("delimiter", ",")
      .option("maxFilesPerTrigger", 1)
      //.option("inferSchema", value = true)
      .load(path)
  }

   val rate = spark.readStream
      .format("rate")
      .option("rowsPerSecond", 1)
      .option("numPartitions", 10)
      .load()
      .withWatermark("timestamp", "1 seconds")

    val cvsStream=readFromCSVFile(tmpPath.toString)
    val cvsStream2 = cvsStream.as("csv").join(rate.as("counter")).where("csv.id == counter.value").withWatermark("timestamp", "1 seconds")

    cvsStream2
      .writeStream
      .trigger(Trigger.ProcessingTime(10))
      .format("console")
      .option("truncate", "false")
      .queryName("kafkaDataGenerator")
      .start().awaitTermination(300000)

CSV file is 6 lines long, but proccessing one batch takes at about 100 s: CSV 文件有 6 行长,但处理一批大约需要 100 秒:

2021-10-15 23:21:29 WARN  ProcessingTimeExecutor:69 - Current batch is falling behind. The trigger interval is 10 milliseconds, but spent 92217 milliseconds
-------------------------------------------
Batch: 1
-------------------------------------------
+---+------+-------+-----------------------+-----+
|id |value1|another|timestamp              |value|
+---+------+-------+-----------------------+-----+
|6  |2     |3.0    |2021-10-15 20:20:02.507|6    |
|5  |2     |2.0    |2021-10-15 20:20:01.507|5    |
|1  |1     |1.0    |2021-10-15 20:19:57.507|1    |
|3  |1     |3.0    |2021-10-15 20:19:59.507|3    |
|2  |1     |2.0    |2021-10-15 20:19:58.507|2    |
|4  |2     |1.0    |2021-10-15 20:20:00.507|4    |
+---+------+-------+-----------------------+-----+

How I can optimize the join operation to process this batch faster?如何优化连接操作以更快地处理此批次? It shouldn't take so many calculation, so it looks like there is a kind of hidden watermarking or what else, making batch to wait for about 100 s.应该不需要那么多计算,所以看起来像是隐藏了水印什么的,让batch等待100s左右。 What kind of options/properties can be applied?可以应用什么样的选项/属性?

I would suggest that you don't have enough data to look into performance yet.我建议您还没有足够的数据来研究性能。 Why don't you crank the data up to 500,000 and see if you have an issue?你为什么不把数据调高到 500,000,看看你是否有问题? Right now I'm concerned that you aren't running enough data to exercise the performance of your system effectively and the startup costs aren't being appropriately amortized by the volume of data.现在我担心您没有运行足够的数据来有效地锻炼系统的性能,并且启动成本没有被数据量适当地摊销。

What dramatically improved the performance?什么显着提高了性能? Usage of spark.read instead of spark.readStream like that and persisting the DataFrame in the memory:像这样使用spark.read而不是spark.readStream并将DataFrame持久DataFrame在内存中:

val dataFrameToBeReturned = spark.read
      .format("csv")
      .schema(schema)
      .option("delimiter", ";")
      .option("maxFilesPerTrigger", 1)
      .csv("hdfs://" + hdfsLocation + homeZeppelinPrefix + "/generator/" + shortPath)
      .persist(StorageLevel.MEMORY_ONLY_SER)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM