持久化 Spark Streaming 输出

Question

I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day我正在从一个消息应用程序收集数据，我目前正在使用 Flume，它每天发送大约 5000 万条记录

I wish to use Kafka, consume from Kafka using Spark Streaming and persist it to hadoop and query with impala我希望使用 Kafka，使用 Spark Streaming 从 Kafka 消费并将其持久化到 hadoop 并使用 impala 查询

I'm having issues with each approach I've tried..我尝试过的每种方法都有问题..

Approach 1 - Save RDD as parquet, point an external hive parquet table to the parquet directory方法 1 - 将 RDD 保存为 parquet，将外部 hive parquet 表指向 parquet 目录

// scala
val ssc =  new StreamingContext(sparkConf, Seconds(bucketsize.toInt))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.foreachRDD(rdd => {

    // 1 - Create a SchemaRDD object from the rdd and specify the schema
    val SchemaRDD1 = sqlContext.jsonRDD(rdd, schema)

    // 2 - register it as a spark sql table
    SchemaRDD1.registerTempTable("sparktable")

    // 3 - qry sparktable to produce another SchemaRDD object of the data needed 'finalParquet'. and persist this as parquet files
    val finalParquet = sqlContext.sql(sql)
    finalParquet.saveAsParquetFile(dir)

The problem is that finalParquet.问题是finalParquet。 saveAsParquetFile outputs a huge number of files, the Dstream received from Kafka outputs over 200 files for a 1 minute batch size. saveAsParquetFile输出大量文件，从 Kafka 接收的 Dstream 输出 200 多个文件，批量为 1 分钟。 The reason that it outputs many files is because the computation is distributed as explained in another post- how to make saveAsTextFile NOT split output into multiple file?它输出许多文件的原因是因为计算是分布的，如另一篇文章中所述 - 如何使 saveAsTextFile 不将输出拆分为多个文件？

However, the propsed solutions don't seem optimal for me , for eg as one user states - Having a single output file is only a good idea if you have very little data.但是，所提出的解决方案对我来说似乎不是最佳的，例如，正如一位用户所说 - 如果您的数据很少，只有一个输出文件才是一个好主意。

Approach 2 - Use HiveContext .方法 2 - 使用HiveContext 。 insert RDD data directly to a hive table将 RDD 数据直接插入配置单元表

# python
sqlContext = HiveContext(sc)
ssc = StreamingContext(sc, int(batch_interval))
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topics: 1})
lines = kvs.map(lambda x: x[1]).persist(StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(sendRecord)

def sendRecord(rdd):

  sql = "INSERT INTO TABLE table select * from beacon_sparktable"

  # 1 - Apply the schema to the RDD creating a data frame 'beaconDF'
  beaconDF = sqlContext.jsonRDD(rdd,schema)

  # 2- Register the DataFrame as a spark sql table.
  beaconDF.registerTempTable("beacon_sparktable")

  # 3 - insert to hive directly from a qry on the spark sql table
  sqlContext.sql(sql);

This works fine , it inserts directly to a parquet table but there are scheduling delays for the batches as processing time exceeds the batch interval time.这很好用，它直接插入到镶木地板表中，但是由于处理时间超过了批处理间隔时间，因此批处理存在调度延迟。 The consumer cant keep up with whats being produced and the batches to process begin to queue up.消费者跟不上正在生产的东西，要处理的批次开始排队。

it seems writing to hive is slow.似乎写入蜂巢很慢。 I've tried adjusting batch interval size, running more consumer instances.我尝试调整批处理间隔大小，运行更多消费者实例。

In summary总之

What is the best way to persist Big data from Spark Streaming given that there are issues with multiple files and potential latency with writing to hive?鉴于存在多个文件的问题以及写入配置单元的潜在延迟，从 Spark Streaming 中持久保存大数据的最佳方法是什么？ What are other people doing?其他人在做什么？

A similar question has been asked here, but he has an issue with directories as apposed to too many files How to make Spark Streaming write its output so that Impala can read it?一个类似的问题已经在这里问过了，但他有一个与太多文件相对的目录问题如何让 Spark Streaming 写入其输出以便 Impala 可以读取它？

Many Thanks for any help非常感谢您的帮助

Answer 1

In solution #2, the number of files created can be controlled via the number of partitions of each RDD.在方案#2中，创建的文件数量可以通过每个RDD的分区数量来控制。

See this example:看这个例子：

// create a Hive table (assume it's already existing)
sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")

// create a RDD with 2 records and only 1 partition
val rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)

// create a DataFrame from the RDD
val schema = StructType(Seq(
 StructField("id", IntegerType, nullable = false),
 StructField("txt", StringType, nullable = false)
))
val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)

// this creates a single file, because the RDD has 1 partition
df.write.mode("append").saveAsTable("test")

Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).现在，我想您可以调整从 Kafka 提取数据的频率，以及每个 RDD 的分区数量（默认为 Kafka 主题的分区，您可以通过重新分区来减少）。

I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.我正在使用 CDH 5.5.1 中的 Spark 1.5，并且使用df.write.mode("append").saveAsTable("test")或您的 SQL 字符串得到相同的结果。

Answer 2

I think the small file problem could be resolved somewhat.我认为小文件问题可以得到一些解决。 You may be getting large number of files based on kafka partitions.您可能会获得大量基于 kafka 分区的文件。 For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs") .对我来说，我有 12 个分区 Kafka 主题，我使用spark.write.mode("append").parquet("/location/on/hdfs")编写。

Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files.现在根据您的要求，您可以添加coalesce(1)或更多以减少文件数量。 Also another option is to increase the micro batch duration.另一种选择是增加微批处理持续时间。 For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.例如，如果您可以接受写作日延迟 5 分钟，则可以有 300 秒的微批。

For the second issues, the batches queue up only because you don't have back pressure enabled.对于第二个问题，批次排队只是因为您没有启用背压。 First you should verify what is the max you can process in a single batch.首先，您应该验证您可以在单个批次中处理的最大值是多少。 Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch.一旦您可以绕过该数字，您可以设置spark.streaming.kafka.maxRatePerPartition值和spark.streaming.backpressure.enabled=true以启用每个微批次的有限记录数。 If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.如果仍然无法满足需求，那么唯一的选择是增加主题分区或增加 Spark 应用程序的资源。

持久化 Spark Streaming 输出

问题描述

In summary总之

2 个解决方案

解决方案1
0 2016-03-11 21:43:59

解决方案2
0 2020-04-16 18:38:33

持久化 Spark Streaming 输出

问题描述

In summary总之

2 个解决方案

解决方案1 0 2016-03-11 21:43:59

解决方案2 0 2020-04-16 18:38:33

解决方案1
0 2016-03-11 21:43:59

解决方案2
0 2020-04-16 18:38:33