如何從 XY 散點圖中刪除異常值

Question

我正在處理一個具有 X 和 Y 值的項目，我嘗試探索必須不存在數據的區域。 從圖中可以看出，大部分數據集中在紅線兩側，也有一些數據在紅線內部。 我只想刪除那些異常值但無法實現。 我嘗試使用反向 KNN 算法或距離計算，但它們在我的數據中不起作用或我無法做到。 有沒有可能的解決方案？

我的散點圖 python 代碼如下。

import pyodbc
import matplotlib.pyplot as plt
from astroML.plotting import scatter_contour
import numpy as np
import pandas as pd

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=test;'
                      'Database=test;'
                      'Trusted_Connection=yes;')

sqlquery= "SELECT test FROM test"

SQL_Query = pd.read_sql_query (sqlquery, conn)


df = pd.DataFrame(SQL_Query, columns=['Data1', 'Data2'])

    
x = df['Data1']
y = df['Data2']
fig,ax = plt.subplots(1,1,figsize=(15,15))
scatter_contour(x,y, threshold=20, log_counts=True, ax=ax,
            histogram2d_args=dict(bins=45),
            plot_args=dict(marker='.', linestyle='none', color='black',
                          markersize=1),
            contour_args=dict(cmap='summer',),
           filled_contour=False)

Answer 1

最簡單的方法是手動選擇並刪除您想要刪除的值。 更復雜的版本是計算核密度估計並過濾低於某個閾值的核密度估計。

from scipy import stats

....

xmin = x.min()

xmax = x.max()

ymin = y.min()

ymax = y.max()

#Perform a kernel density estimate on the data:

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

positions = np.vstack([X.ravel(), Y.ravel()])

values = np.vstack([x, y])

kernel = stats.gaussian_kde(values)

Z = np.reshape(kernel(positions).T, X.shape)

這將為您創建數據的 2d 100x100 近似值。 如果您想要更詳細的核密度估計，您可以將此 100 值更改為更高的值。 如果您將x數據從 0 縮放到 100，那么低於您選擇的閾值的Z值就是您要刪除的點。

df['x_to_scale'] = (100*(x - np.min(x))/np.ptp(x)).astype(int) 
df['y_to_scale'] = (100*(y - np.min(y))/np.ptp(y)).astype(int) 

to_delete = zip(*np.where((Z<your_threshold) == True))

df.drop(df.apply(lambda x: (x['x_to_scale'], x['y_to_scale']), axis=1)
          .isin(to_delete)
          .loc[lambda x: x == True]
          .index)

這將刪除低於某個密度閾值的所有值。

如何從 XY 散點圖中刪除異常值

問題描述

1 個解決方案

解決方案1
0 2021-11-06 09:27:11

如何從 XY 散點圖中刪除異常值

問題描述

1 個解決方案

解決方案1 0 2021-11-06 09:27:11

解決方案1
0 2021-11-06 09:27:11