Flume HDFS接收器不断滚动小文件

Question

I'm trying to stream twitter data into hdfs using flume and this: https://github.com/cloudera/cdh-twitter-example/ 我正在尝试使用flume和以下命令将twitter数据流式传输到hdfs中： https : //github.com/cloudera/cdh-twitter-example/

Whatever I try here, it keeps creating files in HDFS that range in size from 1.5kB to 15kB where I would like to see large files (64Mb). 无论我在这里尝试什么，它都会一直在HDFS中创建文件，文件大小从1.5kB到15kB不等，在这里我希望看到大文件（64Mb）。 Here is the agent configuration: 这是代理配置：

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxx
TwitterAgent.sources.Twitter.keywords = test

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost.localdomain:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 67108864
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000

EDIT: I looked into the log files and found this happening all the time: 编辑：我调查了日志文件，发现这一直在发生：

9:11:27.526 AM WARN org.apache.flume.sink.hdfs.BucketWriter Block Under-replication detected. 9：11：27.526 AM WARN org.apache.flume.sink.hdfs.BucketWriter块检测到复制不足。 Rotating file. 旋转文件。 9:11:37.036 AM ERROR org.apache.flume.sink.hdfs.BucketWriter 9：11：37.036 AM错误org.apache.flume.sink.hdfs.BucketWriter

Hit max consecutive under-replication rotations (30); 达到最大连续重复复制次数不足（30）； will not continue rolling files under this path due to under-replication 由于复制不足，将不会继续在此路径下滚动文件

Answer 1

It seemed to be a problem with the HDFS replication factor. HDFS复制因子似乎有问题。 As I am working on a virtual machine with 1 virtual datanode I had to set the replication factor to 1 in order for it to work as expected. 当我在具有1个虚拟数据节点的虚拟机上工作时，必须将复制因子设置为1才能使其按预期工作。

Answer 2

Set dfs.replication on your cluster to an appropriate value. 将dfs.replication上的dfs.replication设置为适当的值。 This can be done via editing hdfs-site.xml file (on all machines of cluster). 这可以通过编辑hdfs-site.xml文件（在群集的所有计算机上）完成。 However, this is not enough. 但是，这还不够。

You also need to create hdfs-site.xml file on your flume classpath and put the same dfs.replication value from your cluster in it. 您还需要在水槽类路径上创建hdfs-site.xml文件，并将来自群集的相同dfs.replication值放入其中。 Hadoop libraries look at this file while doing operations on the cluster, else they use default values. Hadoop库在集群上执行操作时会查看此文件，否则它们使用默认值。

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>

Flume HDFS接收器不断滚动小文件

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-03-04 09:36:15

解决方案2
2 2015-06-24 08:55:51

Flume HDFS接收器不断滚动小文件

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-03-04 09:36:15

解决方案2 2 2015-06-24 08:55:51

解决方案1
3 已采纳 2014-03-04 09:36:15

解决方案2
2 2015-06-24 08:55:51