简体繁体 English

从Mesos群集收集日志

[英]Collect logs from Mesos Cluster

原文 2015-06-26 09:08:22 7 1 logging/ apache-spark/ flume/ mesos

My team is deploying a new cluster on Amazon EC2 instances. 我的团队正在Amazon EC2实例上部署新集群。 After a bit of research, we decided to go with Apache Mesos as cluster manager and Spark for computation. 经过一番研究，我们决定选择Apache Mesos作为群集管理器，使用Spark进行计算。

The first question we asked ourself is what would be the best way to collect logs from all the machines, for each different framework. 我们问自己的第一个问题是，对于每个不同的框架，从所有计算机收集日志的最佳方法是什么？ Till now, we developed some custom bash/python scripts which collect logs from predefined locations, zip them and send the compressed file to S3. 到现在为止，我们开发了一些自定义的bash / python脚本，这些脚本从预定义的位置收集日志，将其压缩并发送到S3。 This kind of rotation is activated by a cron job, which runs every hour. 这种旋转是由每小时执行一次的cron作业激活的。

I have been searching for the "best" (or standard) way to do this. 我一直在寻找实现此目的的“最佳”（或标准）方法。 I found Apache Flume , which is a data collector also for logs, but I don't understand how could it be integrated in a Mesos cluster to collect logs (and for Spark). 我找到了Apache Flume ，它也是日志的数据收集器，但我不知道如何将其集成到Mesos群集中以收集日志（并用于Spark）。

I found this "similar" question, but the solutions are not Open Source or no more supported. 我发现了这个 “相似”的问题，但是解决方案不是开源的，也不再受支持。

Is there a better way to rotate logs or a standard way I'm missing? 是否有更好的轮换日志方式或我缺少的标准方式？

Thank you very much 非常感谢你

1 个解决方案

There is no perfect answer to this. 没有完美的答案。 If you are using Spark and are interested in using Flume, you will have to either write a custom Flume -> Spark interface as one doesn't exist as far as I know. 如果您使用的是Spark并且对使用Flume感兴趣，则必须编写一个自定义的Flume-> Spark接口，因为据我所知尚不存在。 However, what you can do is this: 但是，您可以执行以下操作：

Use Flume to ingest log data in realtime. 使用Flume实时获取日志数据。
Have Flume do pre-processing on the log data with a custom interceptor. 让Flume使用自定义拦截器对日志数据进行预处理。
Have Flume write to Kafka after pre-processing is done. 完成预处理后，让Flume将数据写入Kafka。
Have Spark streaming read off of the Kafka queue to process the logs and run your computations. 让Spark流读取Kafka队列以处理日志并运行您的计算。

Spark Streaming is supposedly not up to prime time production grade yet but this is one potential solution. 据推测，Spark Streaming尚未达到黄金时段的生产水平，但这是一种潜在的解决方案。