简体   繁体   中英

How to remove outliers from XY scatter plot

I'm working on a project that have X and Y values and I try to explore the area where no data must be exist. As can be seen from the Figures, most of the data gathered in the sides of the red line and there are some data inside the red line. I just want to remove those outliers but couldn't achieve it. I try to use Reverse KNN algorithms or distance calculations but they didn't work in my data or I couldn't make it. Is there any possible solution for this?

My python code for the scatter plot is below.

import pyodbc
import matplotlib.pyplot as plt
from astroML.plotting import scatter_contour
import numpy as np
import pandas as pd

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=test;'
                      'Database=test;'
                      'Trusted_Connection=yes;')

sqlquery= "SELECT test FROM test"

SQL_Query = pd.read_sql_query (sqlquery, conn)


df = pd.DataFrame(SQL_Query, columns=['Data1', 'Data2'])

    
x = df['Data1']
y = df['Data2']
fig,ax = plt.subplots(1,1,figsize=(15,15))
scatter_contour(x,y, threshold=20, log_counts=True, ax=ax,
            histogram2d_args=dict(bins=45),
            plot_args=dict(marker='.', linestyle='none', color='black',
                          markersize=1),
            contour_args=dict(cmap='summer',),
           filled_contour=False)

图1

图2

Easiest way would be to just hand pick and delete the values you want gone. More complicated version would be calculating a kernel density estimation and filtering ones below a certain threshold.

from scipy import stats

....

xmin = x.min()

xmax = x.max()

ymin = y.min()

ymax = y.max()

#Perform a kernel density estimate on the data:

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

positions = np.vstack([X.ravel(), Y.ravel()])

values = np.vstack([x, y])

kernel = stats.gaussian_kde(values)

Z = np.reshape(kernel(positions).T, X.shape)

This would create you a 2d 100x100 approximation of your data. If you want a more detailed kernel density estimation you can change this 100 value to a higher one. If you scale your x data to from 0 to 100 then Z values below the threshold you select are the points you want to delete.

df['x_to_scale'] = (100*(x - np.min(x))/np.ptp(x)).astype(int) 
df['y_to_scale'] = (100*(y - np.min(y))/np.ptp(y)).astype(int) 

to_delete = zip(*np.where((Z<your_threshold) == True))

df.drop(df.apply(lambda x: (x['x_to_scale'], x['y_to_scale']), axis=1)
          .isin(to_delete)
          .loc[lambda x: x == True]
          .index)

This would drop all the values that are below a certain density threshold.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM