简体繁体 English

将日志文件从边缘节点摄取到Hadoop

[英]Ingest log files from edge nodes to Hadoop

原文 2018-06-14 13:19:00 5 1 hadoop/ bigdata/ logstash/ apache-nifi/ flume

I am looking for a way to stream entire log files from edge nodes to Hadoop. 我正在寻找一种将整个日志文件从边缘节点传输到Hadoop的方法。 To sum up the use case: 总结用例：

We have applications that produce log files ranging from a few MB to hundreds of MB per file. 我们的应用程序会生成每个文件从几MB到数百MB不等的日志文件。
We do not want to stream all the log events as they occur. 我们不希望在发生所有日志事件时对其进行流传输。
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us). 我们一直在寻找将日志文件完全写入后再将其全部推送的方法（例如，完全写入=已移至另一个文件夹...这对我们来说不是问题）。
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards. 这应该由边缘节点上的某种轻量级代理直接处理到HDFS，或者在必要时由中间“接收器”处理，该“接收器”随后会将数据推送到HDFS。
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great 集中式管道管理（=以集中方式配置所有边缘节点）会很棒

I came up with the following evaluation: 我提出了以下评估：

Elastic's Logstash and FileBeats Elastic的Logstash和FileBeats
- Centralized pipeline management for edge nodes is available, eg one centralized configuration for all edge nodes (requires a license) 可以对边缘节点进行集中式管道管理，例如，对所有边缘节点进行集中式配置（需要许可证）
- Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS) 配置很容易，为Logstash提供了WebHDFS输出接收器（使用FileBeats将需要具有FileBeats + Logstash的中间解决方案，并将其输出到WebHDFS）
- Both tools are proven to be stable in production-level environments 事实证明这两种工具在生产级环境中都是稳定的
- Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file 两种工具都可以跟踪日志并在发生单个事件时将其流化，而不是提取完整的文件
Apache NiFi w/ MiNiFi 带NiNiFi的Apache NiFi
- The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi 收集日志并将整个文件发送到具有大量边缘节点的另一个位置的用例，这些边缘节点都运行相同的“作业”，这似乎是为NiFi和MiNiFi注定的
- MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight) 在边缘节点上运行的MiNiFi是轻量级的（另一方面，Logstash并不是那么轻量级的）
- Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS 可以将日志从MiNiFi代理流式传输到NiFi群集，然后将其吸收到HDFS中
- Centralized pipeline management within the NiFi UI NiFi UI中的集中式管道管理
- writing to a HDFS sink is available out-of-the-box 开箱即用即可写入HDFS接收器
- Community looks active, development is lead by Hortonworks (?) 社区看起来很活跃，开发是由Hortonworks领导的（？）
- We have made good experiences with NiFi in the past 过去我们在NiFi方面取得了良好的经验
Apache Flume 阿帕奇水槽
- writing to a HDFS sink is available out-of-the-box 开箱即用即可写入HDFS接收器
- Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files 看起来Flume更像是基于事件的解决方案，而不是流式传输整个日志文件的解决方案
- No centralized pipeline management? 没有集中的管道管理？
Apache Gobblin 阿帕奇·哥布林
- writing to a HDFS sink is available out-of-the-box 开箱即用即可写入HDFS接收器
- No centralized pipeline management? 没有集中的管道管理？
- No lightweight edge node "agents"? 没有轻量级的边缘节点“代理”？
Fluentd 流利的
- Maybe another tool to look at? 也许还有另一种工具可以看？ Looking for your comments on this one... 寻找您对此的评论...

I'd love to get some comments about which of the options to choose. 我很想就选择哪个选项发表一些评论。 The NiFi/MiNiFi option looks the most promising to me - and is free to use as well. NiFi / MiNiFi选项在我看来是最有前途的-并且可以免费使用。

Have I forgotten any broadly used tool that is able to solve this use case? 我是否忘记了能够解决此用例的任何广泛使用的工具？

1 个解决方案

I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. 在选择开源大数据解决方案时，我遇到了类似的痛苦，只是通往罗马的道路很多。 Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions. 尽管“寻求技术建议不是Stackoverflow的主题”，但我仍然想分享自己的观点。

I assume you already have a hadoop cluster to land the log files. 我假设您已经有一个hadoop集群来登陆日志文件。 If you are using an enterprise ready distribution eg HDP distribution, stay with their selection of data ingestion solution. 如果您使用的是企业就绪发行版（例如HDP发行版），请选择其数据摄取解决方案。 This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release. 当有新版本发布时，此方法始终可以节省您在安装，设置中心管理和监视，实施安全性和系统集成方面的大量工作。
You didn't mention how you would like to use the log files once they lands in HDFS. 您没有提到将日志文件放入HDFS后如何使用它们。 I assume you just want to make an exact copy, ie data cleansing or data trasformation to a normalized format is NOT required in data ingestion. 我假设您只是想进行精确的复制，即在数据摄取中不需要进行数据清理或将数据转换为规范化格式。 Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node? 现在，我想知道为什么您没有提到最简单的方法，而是使用计划的hdfs命令将日志文件从边缘节点放入hdfs？
Now I can share one production setup I was involved. 现在，我可以共享一个参与的生产设置。 In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. 在此生产设置中，日志文件由商业中介系统推入或拉出，该系统进行数据清理，规范化，充实等。每天的数据量超过1000亿个日志记录。 There is an 6 edge nodes setup behind a load balancer. 负载均衡器后面有6个边缘节点设置。 Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. 日志首先降落在边缘节点之一上，然后将hdfs命令放入HDFS。 Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). 起初使用Flume，但由于性能问题而被这种方法取代（很可能是工程师缺乏优化Flume的经验）。 Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. 不过，值得一提的是，中介系统具有用于安排提取脚本的管理UI。 In your case, I would start with cron job for PoC then use eg Airflow. 在您的情况下，我将从为PoC开始cron作业开始，然后使用例如Airflow。