简体   繁体   English

从卡夫卡到带有笛子的hdfs仅一个文件

[英]only one file to hdfs from kafka with flume

I'm trying to put data in hdfs from kafka via flume. 我正在尝试通过水槽将数据从kafka放入hdfs。 The kafka_producer sends a message every 10 seconds. kafka_producer每10秒发送一条消息。 I'd to collect all messages in one file on hdfs. 我想将所有消息收集在hdfs上的一个文件中。 This is the configuration of flume that i used, but it stores many files on hdfs (one for message): 这是我使用的水槽的配置,但是它在hdfs上存储了许多文件(一个用于发送消息):

agent1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
agent1.sources.kafka-source.zookeeperConnect = localhost:2181
agent1.sources.kafka-source.topic = prova
agent1.sources.kafka-source.groupId = flume
agent1.sources.kafka-source.channels = memory-channel
agent1.sources.kafka-source.interceptors = i1
agent1.sources.kafka-source.interceptors.i1.type = timestamp
agent1.sources.kafka-source.kafka.consumer.timeout.ms = 100
agent1.channels.memory-channel.type = memory
agent1.channels.memory-channel.capacity = 10000
agent1.channels.memory-channel.transactionCapacity = 1000
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = hdfs://localhost:9000/input
agent1.sinks.hdfs-sink.hdfs.rollInterval = 5
agent1.sinks.hdfs-sink.hdfs.rollSize = 0
agent1.sinks.hdfs-sink.hdfs.rollCount = 0
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink.channel = memory-channel
agent1.sources = kafka-source
agent1.channels = memory-channel
agent1.sinks = hdfs-sink

PS I start from a file.csv. PS我从file.csv开始。 The kafka producer takes the file and select some fields of interest, then sends the entries one at time, every 10 seconds. kafka生产者获取文件并选择一些感兴趣的字段,然后每10秒一次发送一个条目。 Flume stores the entries on hadoop hdfs, but in many files (1 entry=1 file). Flume将条目存储在hadoop hdfs上,但是存储在许多文件中(1个条目= 1个文件)。 I would like that all the entries are in one only file. 我希望所有条目都在一个文件中。 how have to change the configuration of flume? 如何更改水槽的配置?

It appears that flume is indeed currently set up to create one file on HDFS for each input file. 看来实际上已经设置了水槽以在HDFS上为每个输入文件创建一个文件。

As suggested here you could deal with this by writing a periodic pig (or mapreduce) job that takes all the input files and combines them. 如此处的建议您可以通过编写一个定期的pig(或mapreduce)作业来处理此问题,该作业将所有输入文件合并在一起。

An additional option to reduce the number of files, may be to reduce the frequency of inbound files. 减少文件数量的另一种选择是减少入站文件的频率。

Establish rollInterval to 0, as you don't want to make different files based on time. 将rollInterval设置为0,因为您不想基于时间制作其他文件。 If you want to make it based on number entry-s or events, change rollCount value. 如果要基于数字条目或事件进行更改,请更改rollCount值。 For example if you want to save 10 events or entry-s in one single file: 例如,如果要将10个事件或条目保存在一个文件中:

agent1.sinks.hdfs-sink.hdfs.rollInterval = 0
agent1.sinks.hdfs-sink.hdfs.rollSize = 0
agent1.sinks.hdfs-sink.hdfs.rollCount = 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM