[英]Flume HDFS sink only stores one line of data source using netcat source
我尝试使用Flume 1.7将数据加载到HDFS中。 我创建了以下配置:
# Starting with: /opt/flume/bin/flume-ng agent -n Agent -c conf -f /opt/flume/conf/test.conf
# Naming the components on the current agent
Agent.sources = Netcat
Agent.channels = MemChannel
Agent.sinks = LoggerSink hdfs-sink LocalOut
# Describing/Configuring the source
Agent.sources.Netcat.type = netcat
Agent.sources.Netcat.bind = 0.0.0.0
Agent.sources.Netcat.port = 56565
# Describing/Configuring the sink
Agent.sinks.LoggerSink.type = logger
# Define a sink that outputs to hdfs.
Agent.sinks.hdfs-sink.type = hdfs
Agent.sinks.hdfs-sink.hdfs.path = hdfs://<<IP of HDFS node>>:8020/user/admin/flume_folder/%y-%m-%d/%H%M/
Agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
Agent.sinks.hdfs-sink.hdfs.fileType = DataStream
Agent.sinks.hdfs-sink.hdfs.writeFormat = Text
Agent.sinks.hdfs-sink.hdfs.batchSize = 100
Agent.sinks.hdfs-sink.hdfs.rollSize = 0
Agent.sinks.hdfs-sink.hdfs.rollCount = 0
Agent.sinks.hdfs-sink.hdfs.rollInterval = 0
Agent.sinks.hdfs-sink.hdfs.idleTimeout = 0
# Schreibt input into local Filesystem
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
Agent.sinks.LocalOut.type = file_roll
Agent.sinks.LocalOut.sink.directory = /tmp/flume
Agent.sinks.LocalOut.sink.rollInterval = 0
# Describing/Configuring the channel
Agent.channels.MemChannel.type = memory
Agent.channels.MemChannel.capacity = 1000
Agent.channels.MemChannel.transactionCapacity = 100
# Bind the source and sink to the channel
Agent.sources.Netcat.channels = MemChannel
Agent.sinks.LoggerSink.channel = MemChannel
Agent.sinks.hdfs-sink.channel = MemChannel
Agent.sinks.LocalOut.channel = MemChannel
之后,我使用netcat将以下文件发送到源:
cat textfile.csv | nc <IP of flume agent> 56565
该文件包含以下元素:
Name1,1
Name2,2
Name3,3
Name4,4
Name5,5
Name6,6
Name7,7
Name8,8
Name9,9
Name10,10
Name11,11
Name12,12
Name13,13
Name14,14
Name15,15
Name16,16
Name17,17
Name18,18
Name19,19
Name20,20
...
Name490,490
Name491,491
Name492,492
我面临的问题是,没有任何错误,水槽正在写入hdfs,但是传输的文件只有一行。 如果您开始使用nectat将文件多次推送到源文件,则有时flume将多个文件写入hdfs,包括多行。 但很少排成一行。
我试图更改hdSize的rollSize,批处理大小和其他参数,但实际上并没有改变行为。
接收器到本地文件也已配置工作正常。
有人知道如何配置它以确保所有条目都写入hdfs而不丢失条目。
谢谢你的帮助。
更新1.12.2016
我删除了除HDFS接收器之外的所有接收器,并更改了一些参数。 之后,HDFS接收器将按预期执行。
这里的配置:
# Naming the components on the current agent
Agent.sources = Netcat
Agent.channels = MemChannel
Agent.sinks = hdfs-sink
# Describing/Configuring the source
Agent.sources.Netcat.type = netcat
Agent.sources.Netcat.bind = 0.0.0.0
Agent.sources.Netcat.port = 56565
# Define a sink that outputs to hdfs.
Agent.sinks.hdfs-sink.type = hdfs
Agent.sinks.hdfs-sink.hdfs.path = hdfs://<<IP of HDFS node>>/user/admin/flume_folder/%y-%m-%d/%H%M/
Agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
Agent.sinks.hdfs-sink.hdfs.fileType = DataStream
Agent.sinks.hdfs-sink.hdfs.writeFormat = Text
Agent.sinks.hdfs-sink.hdfs.batchSize = 100
Agent.sinks.hdfs-sink.hdfs.rollSize = 0
Agent.sinks.hdfs-sink.hdfs.rollCount = 100
# Describing/Configuring the channel
Agent.channels.MemChannel.type = memory
Agent.channels.MemChannel.capacity = 1000
Agent.channels.MemChannel.transactionCapacity = 100
# Bind the source and sink to the channel
Agent.sources.Netcat.channels = MemChannel
Agent.sinks.hdfs-sink.channel = MemChannel
有人知道为什么它可以与此配置一起使用,但是具有两个或多个接收器后,它不再起作用了吗?
我自己找到了解决方案。 据我了解,我对两个接收器使用了相同的通道。 因此,速度更快的接收器将接管所有条目,并且只有某些条目会传递到hdfs接收器。
使用不同的通道并包括使用参数将源散开后
Agent.sources.Netcat.selector.type = replicating
Flume将按预期方式写入本地文件和hdfs。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.