简体   繁体   English

Flume代理:将主机添加到消息,然后发布到kafka主题

[英]Flume agent: add host to message, then publish to a kafka topic

We started to consolidate eventlog data from our applications by publishing messages to a Kafka topic. 我们开始通过将消息发布到Kafka主题来整合来自应用程序的事件日志数据。 Although we could write directly from the application to Kafka, we chose to treat it as a generic problem and use the Flume agent. 虽然我们可以直接从应用程序写入Kafka,但我们选择将其视为一般问题并使用Flume代理。 This provides some flexibility: if we wanted to capture something else from a server, we could just tail a different source and publish to a different Kafka topic. 这提供了一些灵活性:如果我们想从服务器捕获其他东西,我们可以只是拖尾不同的源并发布到不同的Kafka主题。

We created a Flume agent conf file to tail a log and publish to a Kafka topic: 我们创建了一个Flume代理程序配置文件来拖尾日志并发布到Kafka主题:

tier1.sources  = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = tail -F /var/log/some_log.log
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000

tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
tier1.sinks.sink1.topic = some_log
tier1.sinks.sink1.brokerList = hadoop01:9092,hadoop02.com:9092,hadoop03.com:9092
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.batchSize = 20

Unfortunately, the messages themselves don't specify the host that generated them. 不幸的是,消息本身并没有指定生成它们的主机。 If we have an application running on multiple hosts and an error occurs, we have no way to figure out which host generated the message. 如果我们在多个主机上运行应用程序并且发生错误,我们无法确定哪个主机生成了该消息。

I notice that, if Flume wrote directly to HDFS, we could use a Flume interceptor to write to a specific HDFS location. 我注意到,如果Flume直接写入HDFS,我们可以使用Flume拦截器写入特定的HDFS位置。 Although we could probably do something similar with Kafka, ie create a new topic for each server, this could become unwieldy. 虽然我们可能会对Kafka做类似的事情,即为每个服务器创建一个新主题,但这可能会变得难以处理。 We'd end up with thousands of topics. 我们最终会有数千个主题。

Can Flume append/include the hostname of the originating host when it publishes to Kafka topic? 当Flume发布到Kafka主题时,它能否附加/包含原始主机的主机名?

You can create a custom TCP source which reads the client address and adds it to the header. 您可以创建一个自定义TCP源,它读取客户端地址并将其添加到标头中。

@Override
    public void configure(Context context) {
        port = context.getInteger("port");
        buffer = context.getInteger("buffer");

        try{
            serverSocket = new ServerSocket(port);
            logger.info("FlumeTCP source initialized");
        }catch(Exception e) {
            logger.error("FlumeTCP source failed to initialize");
        }
    }

@Override
    public void start() {
        try {
            clientSocket = serverSocket.accept();
            receiveBuffer = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
            logger.info("Connection established with client : " + clientSocket.getRemoteSocketAddress());
            final ChannelProcessor channel = getChannelProcessor();
            final Map<String, String> headers = new HashMap<String, String>();
            headers.put("hostname", clientSocket.getRemoteSocketAddress().toString());
            String line = "";
            List<Event> events = new ArrayList<Event>();

            while ((line = receiveBuffer.readLine()) != null) {
                Event event = EventBuilder.withBody(
                        line, Charset.defaultCharset(),headers);

                logger.info("Event created");
                events.add(event);
                if (events.size() == buffer) {
                    channel.processEventBatch(events);
                }
            }
        } catch (Exception e) {

        }
        super.start();
    }

The flume-conf.properties can be configured as: flume-conf.properties可以配置为:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.


# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'

agent.sources = CustomTcpSource
agent.channels = memoryChannel
agent.sinks = loggerSink

# For each one of the sources, the type is defined
agent.sources.CustomTcpSource.type = com.vishnu.flume.source.CustomFlumeTCPSource
agent.sources.CustomTcpSource.port = 4443
agent.sources.CustomTcpSource.buffer = 1


# The channel can be defined as follows.
agent.sources.CustomTcpSource.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.loggerSink.type = logger

#Specify the channel the sink should use
agent.sinks.loggerSink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100

I sent a test message to test this and it looked like : 我发送了一条测试消息来测试它,它看起来像:

Event: { headers:{hostname=/127.0.0.1:50999} body: 74 65 73 74 20 6D 65 73 73 61 67 65             test message }

I have upload the project in my github 我已经在我的github上传了这个项目

If you're using the exec source, nothing prevents you from running a smart command to prefix the hostname to the log file content. 如果您正在使用exec源,则不会阻止您运行智能命令以将主机名添加到日志文件内容前面。

Note: if the command uses things like pipes, you'll also need to specify the shell like this: 注意:如果命令使用管道之类的东西,你还需要像这样指定shell:

tier1.sources.source1.type = exec
tier1.sources.source1.shell = /bin/sh -c
tier1.sources.source1.command =  tail -F /var/log/auth.log | sed --unbuffered "s/^/$(hostname) /"

The messages look like this: 消息看起来像这样:

frb.hi.inet 2015-11-17 08:39:39.432 INFO [...]

... where frb.hi.inet us the name of my host. ... frb.hi.inet我们主人的名字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM