简体   繁体   English

小文件太多HDFS接收器水槽

[英]Too many small files HDFS Sink Flume

agent.sinks=hpd
agent.sinks.hpd.type=hdfs
agent.sinks.hpd.channel=memoryChannel
agent.sinks.hpd.hdfs.path=hdfs://master:9000/user/hduser/gde
agent.sinks.hpd.hdfs.fileType=DataStream
agent.sinks.hpd.hdfs.writeFormat=Text
agent.sinks.hpd.hdfs.rollSize=0
agent.sinks.hpd.hdfs.batchSize=1000
agent.sinks.hpd.hdfs.fileSuffix=.i  
agent.sinks.hpd.hdfs.rollCount=1000
agent.sinks.hpd.hdfs.rollInterval=0

I'm trying to use HDFS Sink to write events to HDFS. 我正在尝试使用HDFS Sink将事件写入HDFS。 And have tried Size, Count and Time bases rolling but none is working as expected. 并尝试过滚动“大小”,“计数”和“时基”,但没有一个按预期工作。 It is generating too many small files in HDFS like: 它在HDFS中生成太多小文件,例如:

-rw-r--r--   2 hduser supergroup      11617 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832879.i
-rw-r--r--   2 hduser supergroup       1381 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832880.i
-rw-r--r--   2 hduser supergroup        553 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832881.i
-rw-r--r--   2 hduser supergroup       2212 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832882.i
-rw-r--r--   2 hduser supergroup       1379 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832883.i
-rw-r--r--   2 hduser supergroup       2762 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832884.i.tmp

Please assist to resolve the given problem. 请协助解决给定的问题。 I'm using flume 1.6.0 我正在使用水槽1.6.0

~Thanks 〜谢谢

You are now rolling the files for every 1000 items. 您现在每1000个项目滚动一次文件。 You can try either of two methods mentioned below. 您可以尝试下面提到的两种方法。

  1. Try increasing hdfs.rollCount to much higher value, this value decides number of events contained in each rolled file. 尝试将hdfs.rollCount增加到更高的值,该值决定每个滚动文件中包含的事件数。
  2. Remove hdfs.rollCount and set hdfs.rollInterval to interval at which you want to roll your file. 删除hdfs.rollCount并设置hdfs.rollInterval以间隔要滚你的文件。 Say hdfs.rollInterval = 600 to roll file every 10 minutes. hdfs.rollInterval = 600每10分钟滚动文件一次。

For more information refer Flume Documentation 有关更多信息,请参阅Flume文档

My provided configurations were all correct. 我提供的配置都是正确的。 The reason behind such behavior was HDFS. 出现这种现象的原因是HDFS。 I had 2 data nodes out of which one was down. 我有2个数据节点,其中有1个故障。 So, files were not achieving minimum required replication. 因此,文件未达到所需的最低复制要求。 In Flume logs one can see below warning message too: 在Flume日志中,也可以看到以下警告消息:

"Block Under-replication detected. Rotating file." “检测到块复制不足。正在旋转文件。”

To remove this problem one can opt for any of below solution:- 要消除此问题,可以选择以下任一解决方案:-

  • Up the data node to achieve required replication of blocks, or 上数据节点以实现所需的块复制,或者
  • Set property hdfs.minBlockReplicas accordingly. 相应地设置属性hdfs.minBlockReplicas

~Thanks 〜谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM