[英]Ingest log files from edge nodes to Hadoop
I am looking for a way to stream entire log files from edge nodes to Hadoop. 我正在寻找一种将整个日志文件从边缘节点传输到Hadoop的方法。 To sum up the use case: 总结用例:
I came up with the following evaluation: 我提出了以下评估:
I'd love to get some comments about which of the options to choose. 我很想就选择哪个选项发表一些评论。 The NiFi/MiNiFi option looks the most promising to me - and is free to use as well. NiFi / MiNiFi选项在我看来是最有前途的-并且可以免费使用。
Have I forgotten any broadly used tool that is able to solve this use case? 我是否忘记了能够解决此用例的任何广泛使用的工具?
I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. 在选择开源大数据解决方案时,我遇到了类似的痛苦,只是通往罗马的道路很多。 Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions. 尽管“寻求技术建议不是Stackoverflow的主题”,但我仍然想分享自己的观点。
I assume you already have a hadoop cluster to land the log files. 我假设您已经有一个hadoop集群来登陆日志文件。 If you are using an enterprise ready distribution eg HDP distribution, stay with their selection of data ingestion solution. 如果您使用的是企业就绪发行版(例如HDP发行版),请选择其数据摄取解决方案。 This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release. 当有新版本发布时,此方法始终可以节省您在安装,设置中心管理和监视,实施安全性和系统集成方面的大量工作。
You didn't mention how you would like to use the log files once they lands in HDFS. 您没有提到将日志文件放入HDFS后如何使用它们。 I assume you just want to make an exact copy, ie data cleansing or data trasformation to a normalized format is NOT required in data ingestion. 我假设您只是想进行精确的复制,即在数据摄取中不需要进行数据清理或将数据转换为规范化格式。 Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node? 现在,我想知道为什么您没有提到最简单的方法,而是使用计划的hdfs命令将日志文件从边缘节点放入hdfs?
Now I can share one production setup I was involved. 现在,我可以共享一个参与的生产设置。 In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. 在此生产设置中,日志文件由商业中介系统推入或拉出,该系统进行数据清理,规范化,充实等。每天的数据量超过1000亿个日志记录。 There is an 6 edge nodes setup behind a load balancer. 负载均衡器后面有6个边缘节点设置。 Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. 日志首先降落在边缘节点之一上,然后将hdfs命令放入HDFS。 Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). 起初使用Flume,但由于性能问题而被这种方法取代(很可能是工程师缺乏优化Flume的经验)。 Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. 不过,值得一提的是,中介系统具有用于安排提取脚本的管理UI。 In your case, I would start with cron job for PoC then use eg Airflow. 在您的情况下,我将从为PoC开始cron作业开始,然后使用例如Airflow。
Hope it helps! 希望能帮助到你! And would be glad to know your final choice and your implementation. 并且很高兴知道您的最终选择和实现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.