[英]Flume agent: add host to message, then publish to a kafka topic
我們開始通過將消息發布到Kafka主題來整合來自應用程序的事件日志數據。 雖然我們可以直接從應用程序寫入Kafka,但我們選擇將其視為一般問題並使用Flume代理。 這提供了一些靈活性:如果我們想從服務器捕獲其他東西,我們可以只是拖尾不同的源並發布到不同的Kafka主題。
我們創建了一個Flume代理程序配置文件來拖尾日志並發布到Kafka主題:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = exec
tier1.sources.source1.command = tail -F /var/log/some_log.log
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
tier1.sinks.sink1.topic = some_log
tier1.sinks.sink1.brokerList = hadoop01:9092,hadoop02.com:9092,hadoop03.com:9092
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.batchSize = 20
不幸的是,消息本身並沒有指定生成它們的主機。 如果我們在多個主機上運行應用程序並且發生錯誤,我們無法確定哪個主機生成了該消息。
我注意到,如果Flume直接寫入HDFS,我們可以使用Flume攔截器寫入特定的HDFS位置。 雖然我們可能會對Kafka做類似的事情,即為每個服務器創建一個新主題,但這可能會變得難以處理。 我們最終會有數千個主題。
當Flume發布到Kafka主題時,它能否附加/包含原始主機的主機名?
您可以創建一個自定義TCP源,它讀取客戶端地址並將其添加到標頭中。
@Override
public void configure(Context context) {
port = context.getInteger("port");
buffer = context.getInteger("buffer");
try{
serverSocket = new ServerSocket(port);
logger.info("FlumeTCP source initialized");
}catch(Exception e) {
logger.error("FlumeTCP source failed to initialize");
}
}
@Override
public void start() {
try {
clientSocket = serverSocket.accept();
receiveBuffer = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
logger.info("Connection established with client : " + clientSocket.getRemoteSocketAddress());
final ChannelProcessor channel = getChannelProcessor();
final Map<String, String> headers = new HashMap<String, String>();
headers.put("hostname", clientSocket.getRemoteSocketAddress().toString());
String line = "";
List<Event> events = new ArrayList<Event>();
while ((line = receiveBuffer.readLine()) != null) {
Event event = EventBuilder.withBody(
line, Charset.defaultCharset(),headers);
logger.info("Event created");
events.add(event);
if (events.size() == buffer) {
channel.processEventBatch(events);
}
}
} catch (Exception e) {
}
super.start();
}
flume-conf.properties可以配置為:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'
agent.sources = CustomTcpSource
agent.channels = memoryChannel
agent.sinks = loggerSink
# For each one of the sources, the type is defined
agent.sources.CustomTcpSource.type = com.vishnu.flume.source.CustomFlumeTCPSource
agent.sources.CustomTcpSource.port = 4443
agent.sources.CustomTcpSource.buffer = 1
# The channel can be defined as follows.
agent.sources.CustomTcpSource.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.loggerSink.type = logger
#Specify the channel the sink should use
agent.sinks.loggerSink.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100
我發送了一條測試消息來測試它,它看起來像:
Event: { headers:{hostname=/127.0.0.1:50999} body: 74 65 73 74 20 6D 65 73 73 61 67 65 test message }
我已經在我的github上傳了這個項目
如果您正在使用exec
源,則不會阻止您運行智能命令以將主機名添加到日志文件內容前面。
注意:如果命令使用管道之類的東西,你還需要像這樣指定shell:
tier1.sources.source1.type = exec
tier1.sources.source1.shell = /bin/sh -c
tier1.sources.source1.command = tail -F /var/log/auth.log | sed --unbuffered "s/^/$(hostname) /"
消息看起來像這樣:
frb.hi.inet 2015-11-17 08:39:39.432 INFO [...]
... frb.hi.inet
我們主人的名字。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.