使用分区为日期的文件从kafka写入hdfs的最有效方法是什么

Question

I'm working on project that should write via kafka to hdfs. 我正在研究应该通过kafka写入hdfs的项目。 Suppose there is online server that writes messages into the kafka. 假设有在线服务器将消息写入kafka。 Each message includes timestamp in it. 每条消息都包含时间戳。 I want to create a job that the output will be a file/files according to timestamp in messages. 我想根据消息中的时间戳创建一个输出将是文件/文件的作业。 For example if the data in kafka is 例如，如果kafka中的数据是

 {"ts":"01-07-2013 15:25:35.994", "data": ...}
 ...    
 {"ts":"01-07-2013 16:25:35.994", "data": ...}
 ... 
 {"ts":"01-07-2013 17:25:35.994", "data": ...}

I would like to get the 3 files as output 我想得到3个文件作为输出

  kafka_file_2013-07-01_15.json
  kafka_file_2013-07-01_16.json
  kafka_file_2013-07-01_17.json

And of course If I'm running this job once again and there is a new messages in queue like 当然，如果我再次运行这个工作，并且有一个新的消息在队列中

 {"ts":"01-07-2013 17:25:35.994", "data": ...}

It should create a file 它应该创建一个文件

  kafka_file_2013-07-01_17_2.json // second  chunk of hour 17

I've seen some open sources but most of them reads from kafka to some hdfs folder. 我见过一些开源，但大多数都是从kafka读到一些hdfs文件夹。 What is the best solution/design/opensource for this problem 这个问题的最佳解决方案/设计/开源是什么

Answer 1

You should definitely check out Camus API implementation from linkedIn. 您一定要从linkedIn查看Camus API实现。 Camus is LinkedIn's Kafka->HDFS pipeline. Camus是LinkedIn的Kafka-> HDFS管道。 It is a mapreduce job that does distributed data loads out of Kafka. 它是一个mapreduce作业，可以从Kafka中分发数据。 Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps. 看看我为一个简单的例子写的这篇文章，它从twitter流中提取并根据推文时间戳写入HDFS。

Project is available at github at - https://github.com/linkedin/camus 项目可在github上获得 - https://github.com/linkedin/camus

Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS – Camus需要两个主要组件来读取和解码Kafka的数据并将数据写入HDFS -

Decoding Messages read from Kafka 解码从Kafka读取的消息

Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder which implements logic to partition data based on timestamp. Camus有一组com.linkedin.camus.coders.MessageDecoder有助于解码来自Kafka的消息， com.linkedin.camus.coders.MessageDecoder基本上扩展了com.linkedin.camus.coders.MessageDecoder ，它实现了基于时间戳分区数据的逻辑。 A set of predefined Decoders are present in this directory and you can write your own based on these. 此目录中存在一组预定义的解码器，您可以根据这些编写自己的解码器。 camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/

Writing messages to HDFS 将消息写入HDFS

Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what's the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these. Camus需要一组RecordWriterProvider类，它们扩展了com.linkedin.camus.etl.RecordWriterProvider ，它将告诉Camus应该写入HDFS的有效负载。这个目录中有一组预定义的RecordWriterProvider，你可以自己编写自己的这些。

camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common

Answer 2

If you're looking for a more real-time approach you should check out StreamSets Data Collector . 如果您正在寻找更实时的方法，您应该查看StreamSets Data Collector 。 It's also an Apache licensed open source tool for ingest. 它也是一个Apache许可的开源工具，用于摄取。

The HDFS destination is configurable to write to time based directories based on the template you specify. HDFS目标可配置为根据您指定的模板写入基于时间的目录。 And it already includes a way to specify a field in your incoming messages to use to determine the time a message should be written. 它已经包含一种在传入消息中指定字段的方法，用于确定消息的写入时间。 The config is called "Time Basis" and you can specify something like ${record:value("/ts")} . 配置称为“时间基础”，您可以指定类似${record:value("/ts")} 。

*full disclosure I'm an engineer working on this tool. *完全披露我是这个工具的工程师。

Answer 3

Check this out for continuous ingestion from Kafka to HDFS. 检查一下从Kafka到HDFS的持续摄取。 Since it depends on Apache Apex , it has the guarantees Apex provides. 由于它依赖于Apache Apex ，因此它具有Apex提供的保证。

https://www.datatorrent.com/apphub/kafka-to-hdfs-sync/ https://www.datatorrent.com/apphub/kafka-to-hdfs-sync/

Answer 4

if you are using Apache Kafka 0.9 or above, you can use the Kafka Connect API. 如果您使用的是Apache Kafka 0.9或更高版本，则可以使用Kafka Connect API。

check out https://github.com/confluentinc/kafka-connect-hdfs 看看https://github.com/confluentinc/kafka-connect-hdfs

This is a Kafka connector for copying data between Kafka and HDFS. 这是一个Kafka连接器，用于在Kafka和HDFS之间复制数据。

Answer 5

Checkout Camus: https://github.com/linkedin/camus Checkout Camus： https ： //github.com/linkedin/camus

This will write data in Avro format though... others RecordWriters are pluggable. 这将以Avro格式写入数据，但其他RecordWrite可插拔。

使用分区为日期的文件从kafka写入hdfs的最有效方法是什么

问题描述

5 个解决方案

解决方案1
5 已采纳 2015-02-19 05:50:14

Decoding Messages read from Kafka 解码从Kafka读取的消息

Writing messages to HDFS 将消息写入HDFS

解决方案2
2 2015-11-11 00:10:29

解决方案3
1 2016-11-15 01:28:39

解决方案4
1 2017-02-09 08:50:21

解决方案5
0 2013-07-10 00:09:11

使用分区为日期的文件从kafka写入hdfs的最有效方法是什么

问题描述

5 个解决方案

解决方案1 5 已采纳 2015-02-19 05:50:14

Decoding Messages read from Kafka 解码从Kafka读取的消息

Writing messages to HDFS 将消息写入HDFS

解决方案2 2 2015-11-11 00:10:29

解决方案3 1 2016-11-15 01:28:39

解决方案4 1 2017-02-09 08:50:21

解决方案5 0 2013-07-10 00:09:11

解决方案1
5 已采纳 2015-02-19 05:50:14

解决方案2
2 2015-11-11 00:10:29

解决方案3
1 2016-11-15 01:28:39

解决方案4
1 2017-02-09 08:50:21

解决方案5
0 2013-07-10 00:09:11