使用Spark在Hive中流式传输数据存储

Question

I am creating a application in which getting streaming data which goes into kafka and then on spark. 我正在创建一个应用程序，在该应用程序中获取流式数据，这些数据先进入kafka，然后出现火花。 consume the data, apply some login and then save processed data into the hive. 使用数据，进行一些登录，然后将处理后的数据保存到配置单元中。 velocity of data is very fast. 数据的速度非常快。 I am getting 50K records in 1min. 我在1分钟内得到5万条记录。 There is window of 1 min in spark streaming in which it process the data and save the data in the hive. 火花流中有1分钟的窗口，在该窗口中它处理数据并将数据保存在配置单元中。

my question is for production prospective architecture is fine? 我的问题是生产前瞻性架构是否还好？ If yes how can I save the streaming data into hive. 如果是，我如何将流数据保存到配置单元中。 What I am doing is, creating dataframe of 1 min window data and will save it in hive by using 我正在做的是，创建1分钟窗口数据的数据框，并通过使用将其保存在配置单元中

results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")

I have not created the pipeline. 我尚未创建管道。 Is it fine or I have to modified the architecture? 还好吗？还是我必须修改架构？

Thanks 谢谢

Answer 1

I would give it a try! 我会尝试一下！

BUT kafka->spark->hive is not the optimal pipline for your usecase. 但是kafka-> spark-> hive不是您的用例的最佳方案。

hive is normally based on hdfs which is not designed for small number of inserts/updates/selects. 配置单元通常基于hdfs，而该hdfs并非设计用于少量插入/更新/选择。 So your plan can end up in the following problems: 因此，您的计划可能会遇到以下问题：
- many small files which ends in bad performance 许多小文件以不良的表现而告终
- your window gets to small because it takes to long 您的窗口变小，因为它需要很长时间

Suggestion: 建议：

option 1: - use kafka just as buffer queue and design your pipeline like - kafka->hdfs(eg with spark or flume)->batch spark to hive/impala table 选项1：-将kafka用作缓冲队列，并像-kafka-> hdfs（例如，带有火花或水槽）->将spark批量分配到hive / impala表中那样设计管道

Option 2: 选项2：

kafka->flume/spark to hbase/kudu->batch spark to hive/impala kafka->水槽/火花到hbase / kudu->批量火花到蜂巢/黑斑羚

option 1 has no "realtime" analysis option. 选项1没有“实时”分析选项。 It depends on how often you run the batch spark 这取决于您运行批处理火花的频率

option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. option2是一个不错的选择，我建议将30天存储在hbase中，将所有较旧的数据存储在hive / impala中。 With a view you will be able to join new and old data for realtime analysis. 通过查看，您将能够结合新旧数据进行实时分析。 Kudu makes the architecture even easier. Kudu使体系结构更加容易。

Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql. 如果您希望对数据进行分区并通过HIVEsql使用，则将数据保存到配置单元表中可能会很棘手。

But basicly it would work like the following: 但从根本上讲，它的工作原理如下：

xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")

BR BR

使用Spark在Hive中流式传输数据存储

问题描述

1 个解决方案

解决方案1
3 2017-09-06 17:25:53

使用Spark在Hive中流式传输数据存储

问题描述

1 个解决方案

解决方案1 3 2017-09-06 17:25:53

解决方案1
3 2017-09-06 17:25:53