简体   繁体   English

如何从 XY 散点图中删除异常值

[英]How to remove outliers from XY scatter plot

I'm working on a project that have X and Y values and I try to explore the area where no data must be exist.我正在处理一个具有 X 和 Y 值的项目,我尝试探索必须不存在数据的区域。 As can be seen from the Figures, most of the data gathered in the sides of the red line and there are some data inside the red line.从图中可以看出,大部分数据集中在红线两侧,也有一些数据在红线内部。 I just want to remove those outliers but couldn't achieve it.我只想删除那些异常值但无法实现。 I try to use Reverse KNN algorithms or distance calculations but they didn't work in my data or I couldn't make it.我尝试使用反向 KNN 算法或距离计算,但它们在我的数据中不起作用或我无法做到。 Is there any possible solution for this?有没有可能的解决方案?

My python code for the scatter plot is below.我的散点图 python 代码如下。

import pyodbc
import matplotlib.pyplot as plt
from astroML.plotting import scatter_contour
import numpy as np
import pandas as pd

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=test;'
                      'Database=test;'
                      'Trusted_Connection=yes;')

sqlquery= "SELECT test FROM test"

SQL_Query = pd.read_sql_query (sqlquery, conn)


df = pd.DataFrame(SQL_Query, columns=['Data1', 'Data2'])

    
x = df['Data1']
y = df['Data2']
fig,ax = plt.subplots(1,1,figsize=(15,15))
scatter_contour(x,y, threshold=20, log_counts=True, ax=ax,
            histogram2d_args=dict(bins=45),
            plot_args=dict(marker='.', linestyle='none', color='black',
                          markersize=1),
            contour_args=dict(cmap='summer',),
           filled_contour=False)

图1

图2

Easiest way would be to just hand pick and delete the values you want gone.最简单的方法是手动选择并删除您想要删除的值。 More complicated version would be calculating a kernel density estimation and filtering ones below a certain threshold.更复杂的版本是计算核密度估计并过滤低于某个阈值的核密度估计。

from scipy import stats

....

xmin = x.min()

xmax = x.max()

ymin = y.min()

ymax = y.max()

#Perform a kernel density estimate on the data:

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

positions = np.vstack([X.ravel(), Y.ravel()])

values = np.vstack([x, y])

kernel = stats.gaussian_kde(values)

Z = np.reshape(kernel(positions).T, X.shape)

This would create you a 2d 100x100 approximation of your data.这将为您创建数据的 2d 100x100 近似值。 If you want a more detailed kernel density estimation you can change this 100 value to a higher one.如果您想要更详细的核密度估计,您可以将此 100 值更改为更高的值。 If you scale your x data to from 0 to 100 then Z values below the threshold you select are the points you want to delete.如果您将x数据从 0 缩放到 100,那么低于您选择的阈值的Z值就是您要删除的点。

df['x_to_scale'] = (100*(x - np.min(x))/np.ptp(x)).astype(int) 
df['y_to_scale'] = (100*(y - np.min(y))/np.ptp(y)).astype(int) 

to_delete = zip(*np.where((Z<your_threshold) == True))

df.drop(df.apply(lambda x: (x['x_to_scale'], x['y_to_scale']), axis=1)
          .isin(to_delete)
          .loc[lambda x: x == True]
          .index)

This would drop all the values that are below a certain density threshold.这将删除低于某个密度阈值的所有值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM