简体繁体 English

pyspark数据帧中的异常值检测

[英]outlier detection in pyspark dataframe

原文 2017-09-23 08:50:15 4 1 python/ apache-spark/ dataframe/ pyspark/ pyspark-sql

I'm very new to Spark and Hadoop world. 我是Spark和Hadoop世界的新手。 I've started learning these topics on my own from Internet. 我已经开始从Internet上自己学习这些主题。 I wanted to know how we can perform outlier detection in Spark DataFrame given that a DataFrame in Spark is immutable? 我想知道鉴于Spark中的DataFrame是不可变的，我们如何在Spark DataFrame中执行离群值检测？ Is there any Spark package or module which can perform this? 有没有可以执行此操作的Spark软件包或模块？ I'm using PySpark API for Spark, so I will be highly grateful if someone reply on how this can be done in PySpark. 我正在使用用于Spark的PySpark API，因此，如果有人对如何在PySpark中完成操作进行回复，我将不胜感激。 Will highly appreciate if I get a small code for performing outlier detection in Spark DataFrame in PySPark(Pyhton). 如果我能在PySPark（Pyhton）的Spark DataFrame中获得用于执行离群值检测的小代码，将不胜感激。 Thanks a lot in advance! 在此先多谢！

1 个解决方案

To my knowledge, there is no a API neither a package that dedicated to detecting outliers as the data itself varies depending on the application. 据我所知，没有一个API或一个专用于检测异常值的软件包，因为数据本身随应用程序而变化。 However, there are couple of known methods that all help to identify the outliers. 但是，有两种已知方法都可以帮助识别异常值。 Let's first look at what the term outliers means, it simply refers to the extreme values that fall outside the scope/range of the observations. 首先让我们看一下异常值一词的含义，它简单地指的是超出观察范围/范围的极值。 A good example of how these outliers can be seen is when visualizing the data in a histogram fashion or scatter plot, they can strongly influence the statics and much compress the meaningful data. 关于如何观察这些异常值的一个很好的例子是，当以直方图方式或散点图可视化数据时，它们会强烈影响静态数据并极大地压缩有意义的数据。 Or they can be seen as a strong influence on the statistical summary of the data. 或者可以将它们视为对数据统计摘要的强大影响。 such as after using the mean or the standard deviations. 例如使用均值或标准差后。 This certainly will be misleading, the danger will be when we use training data that contain outliers, training will take longer time as the model will be struggling with the out-of-range values, hence we land in a less accurate model and poor result or 'never converging objective measure', ie, comparing the output/scoring of the test and training with respect to training time or some accuracy value range. 这肯定会产生误导，危险是当我们使用包含异常值的训练数据时，训练将花费更长的时间，因为模型将在超出范围的值内挣扎，因此我们得出的模型较不准确，结果很差或“从不收敛客观指标”，即，将测试和培训的输出/得分相对于培训时间或某个精度值范围进行比较。

Although, it is common to have outliers as an undesirable entities in your data, they still can be sign for anomalies and there their detection itself will be a method to spotting frauds or improving security. 尽管在数据中将异常值作为不受欢迎的实体是很常见的，但是异常值仍然可以作为异常的信号，并且它们的检测本身将是发现欺诈或提高安全性的一种方法。

Here are some k own methods for outliers detection (more details can be found in this good article ): 这是一些自己的异常值检测方法（更多详细信息可以在本文中找到）：

Extreme Value Analysis, 极值分析
Probabilistic and Statistical Models, 概率和统计模型
Linear Models: reduce the data dimension, 线性模型：减少数据量，
Proximity-based Models: mainly using clustering. 基于接近度的模型：主要使用聚类。

For the code, I suggest this good tutorial from mapr. 对于代码，我建议从mapr那里获得这个好的教程。 ANd hope this answer helps. 希望这个答案能对您有所帮助。 Good luck. 祝好运。