如何用密度圖識別離群值

Question

我正在嘗試通過密度圖識別異常值。 我目前正在使用seaborn庫來繪制我的數據。 我將如何識別異常值？ 我一直在考慮用stats庫實現Z分數，這是不能在密度圖中完成的唯一方法嗎？

Answer 1

內核密度估計是通過給定數據對假設概率密度函數（pdf）的估計。 現在，我們有一個問題：哪些數據點應視為異常值。 離群值是罕見的數據點，即pdf極低的那些點。 我們不知道pdf，但知道它的估計。 因此，我們可以使用此估計來識別異常值。

因此，基本思路是：1）計算所有數據點的核密度估計； 2）找到這些點，其估計值低於某個預定義的閾值。 后者將是異常值。

讓我們編寫一些代碼來說明這一點。

import numpy as np
# import seaborn as sns # you probably can use seaborn to get pdf-estimation values, I would use scikit-learn package for this.
from matplotlib import pyplot as plt
from sklearn.neighbors import KernelDensity

# 100 normally distributed data points and approximately 10 outliers in the end of the array.
data = np.r_[np.random.randn(100), np.random.rand(10)*100][:, np.newaxis]

# you an use kernel='gaussian' instead
kde = KernelDensity(kernel='tophat', bandwidth=0.75).fit(data)

yvals = kde.score_samples(data)  # yvals are logs of pdf-values
yvals[np.isinf(yvals)] = np.nan # some values are -inf, set them to nan

# approx. 10 percent of smallest pdf-values: lets treat them as outliers 
outlier_inds = np.where(yvals < np.percentile(yvals, 10))[0]
print(outlier_inds)
non_outlier_inds = np.where(yvals >= np.percentile(yvals, 10))[0]
print(non_outlier_inds)

[ 33  49 100 101 102 103 105 106 107 108 109]
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  50  51  52  53  54  55
  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73
  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
  92  93  94  95  96  97  98  99 104]

# I applied log to data points because we need to visualize small (0,1) and large (up to 100) values on the same plot.
plt.plot(non_outlier_inds, np.log(data[non_outlier_inds]), 'ro',
         outlier_inds, np.log(data[outlier_inds]), 'bo')
plt.gca().set_xlabel('Index')
plt.gca().set_ylabel('log(data)')
plt.show()

如何用密度圖識別離群值

問題描述

1 個解決方案

解決方案1
1 2019-04-19 06:57:53

如何用密度圖識別離群值

問題描述

1 個解決方案

解決方案1 1 2019-04-19 06:57:53

解決方案1
1 2019-04-19 06:57:53