将kafka的批处理流火花生成单个文件

Question

I am streaming data from Kafka using batch streaming (maxRatePerPartition 10.000). 我正在使用批处理流（maxRatePerPartition 10.000）从Kafka流数据。 So in each batch I process 10.000 kafka messages. 因此，每批处理10.000 kafka消息。

Within this batch run I process each message by creating a dataFrame out of the rdd. 在此批处理运行中，我通过从rdd中创建一个dataFrame来处理每条消息。 After processing, I save each processed record to the same file using: dataFrame.write.mode(SaveMode.append). 处理之后，我使用以下命令将每个处理的记录保存到同一文件：dataFrame.write.mode（SaveMode.append）。 So it appends all messages to the same file. 因此，它将所有消息附加到同一文件。

This is ok as long as it is running within one batch run. 只要在一次批处理中运行就可以。 But after the next batch run is executed (next 10.000 messages are processed) it creates a new file for the next 10.000 messages. 但是，在执行下一个批处理运行（处理了下10.000条消息）之后，它将为接下来的10.000条消息创建一个新文件。

The problem is now: Each file (block) reserves 50mb of the file system but only contains around 1mb (10.000 messages). 现在的问题是：每个文件（块）保留文件系统的50mb，但仅包含1mb（10.000条消息）。 Instead of creating new files each batch run, I would prefer to have it all appended to one file as long as it is not exceeding the 50mb. 我宁愿不将每个批处理都创建新文件，而是将它们全部附加到一个文件中，只要它不超过50mb。

Do you know how to do this or why it is not workin in my example? 您知道如何执行此操作，或者为什么在我的示例中它不起作用？ You can have a look at my coding here: 您可以在这里查看我的编码：

import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.immutable.Set


object SparkStreaming extends Constants {


  def main(args: Array[String]) {

//create a new Spark configuration...
val conf = new SparkConf()
  .setMaster("local[2]") // ...using 2 cores
  .setAppName("Streaming")
  .set("spark.streaming.kafka.maxRatePerPartition", "10000")  //... processing max. 10000 messages per second

//create a streaming context for micro batch
val ssc = new StreamingContext(conf, Seconds(1)) //Note: processing max. 1*10000 messages (see config above.)

//Setup up Kafka DStream
val kafkaParams = Map("metadata.broker.list" -> "sandbox.hortonworks.com:6667",
  "auto.offset.reset" -> "smallest") //Start from the beginning
val kafkaTopics = Set(KAFKA_TOPIC_PARQUET)

val directKafkaStream = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,
  kafkaParams, kafkaTopics)

val records = directKafkaStream.map(Source => StreamingFunctions.transformAvroSource(Source))


records.foreachRDD((rdd: RDD[TimeseriesRddRecord], time: Time) => {
  val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // Worker node singleton
  import sqlContext.implicits._

  val dataFrame = rdd.toDF()

  dataFrame.write.mode(SaveMode.Append).partitionBy(PARQUET_PARTITIONBY_COLUMNS :_*).parquet(PARQUET_FILE_PATH_TIMESERIES_LOCAL)
  println(s"Written entries: ${dataFrame.count()}")
}
)


//start streaming until the process is killed
ssc.start()
ssc.awaitTermination()

  }


  /** Case class for converting RDD to DataFrame */
  case class DataFrameRecord(thingId: String, timestamp: Long, propertyName: String, propertyValue: Double)


  /** Lazily instantiated singleton instance of SQLContext */
  object SQLContextSingleton {

@transient private var instance: SQLContext = _

def getInstance(sparkContext: SparkContext): SQLContext = {
  if (instance == null) {
    instance = new SQLContext(sparkContext)
  }
  instance
    }
  }

}

I would be happy to get your thoughts on that. 我很高兴得到您的想法。 Thanks, Alex 谢谢，亚历克斯

Answer 1

This can be accomplished by using the coalesce function and then by overwriting the existing file. 这可以通过使用coalesce功能然后覆盖现有文件来完成。

But as discussed in the thread Spark coalesce looses file when program is aborted it comes to errors when the program is interrupted. 但是，正如线程中所讨论的那样，当程序中止时Spark合并会释放文件，而当程序中断时会出错。

So for the time being it seems to be not sufficient to implement such logic. 因此，暂时似乎不足以实现这种逻辑。

将kafka的批处理流火花生成单个文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-10-29 08:43:35

将kafka的批处理流火花生成单个文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-10-29 08:43:35

解决方案1
1 已采纳 2015-10-29 08:43:35