简体繁体 English

AWS SageMaker随机砍伐森林还是Kinesis Data Analytics随机砍伐森林？

[英]AWS SageMaker Random Cut Forest or Kinesis Data Analytics Random Cut Forest?

原文 2018-07-27 14:47:19 4 1 amazon-web-services/ amazon-kinesis/ amazon-kinesis-firehose/ amazon-sagemaker

I need to put together an architecture that can detect anomalies in logs created by a web application. 我需要建立一个可以检测Web应用程序创建的日志中异常的体系结构。

The Random Cut Forest algorithm constantly pops up in my research, where it is used in two scenarios: SageMaker and Kinesis Data Analytics. 我的研究不断弹出“随机砍伐森林”算法，该算法在两种情况下使用：SageMaker和Kinesis Data Analytics。

Which of these two services should I use in my architecture? 我应该在体系结构中使用这两项服务中的哪一项？

1 个解决方案

At the core, the mathematical methodology between the two is nearly identical, but there are some differences in how they are implemented within Kinesis and SageMaker that should help drive your decision. 从根本上说，两者之间的数学方法几乎完全相同，但是在Kinesis和SageMaker中如何实现它们方面存在一些差异，这应该有助于您做出决定。

Kinesis RandomCutForest: Kinesis RandomCutForest：

Streaming version of the algorithm which is great for near-real-time updates to the model. 该算法的流版本非常适合对模型进行近实时更新。
Supports time decay of older records, shingling of the input data, and if you are using multiple dimensions, anomaly attribution that helps you understand the effect of each of the dimensions. 支持旧记录的时间衰减，输入数据的混合以及如果您使用的是多个维度，则异常归因可以帮助您了解每个维度的影响。
So, in case your logs are being stored in CloudWatch, by using subscription filters (and Lambda if needed) you can get them preprocessed and sent to Kinesis with little effort. 因此，如果您的日志存储在CloudWatch中，则可以使用订阅过滤器（如果需要，还可以使用Lambda），可以对其进行预处理并毫不费力地发送到Kinesis。

SageMaker RandomCutForest: SageMaker RandomCutForest：

Batch version of the algorithm, great for large datasets (typically stored in S3) or where there's no need to update the model frequently. 该算法的批处理版本非常适合大型数据集（通常存储在S3中）或不需要频繁更新模型的地方。
Similar to Kinesis, supports near-real-time scoring of incoming data points via inference endpoint, but new data points do not change the underlying model. 与Kinesis相似，它支持通过推断端点对传入数据点进行近实时评分，但是新数据点不会更改基础模型。
Supports hyper parameter optimization, which identifies the best set of parameters for your model (such as number of samples, number of trees etc.) 支持超级参数优化，该优化可确定模型的最佳参数集（例如样本数，树数等）。
Scaling up instances for both training and scoring is straightforward, and the available SageMaker Notebooks can help you preprocess and prepare your data for training. 扩展实例以进行培训和评分非常简单，可用的SageMaker Notebook可帮助您预处理和准备数据以进行培训。
So, if your dataset is large and you don't have a need for dynamic updates to your model, SageMaker solution should be preferred solution for you. 因此，如果数据集很大并且不需要动态更新模型，则SageMaker解决方案应该是您的首选解决方案。