簡體   English   中英

如何使用python查找給定數據集中的離群值

[英]How to find outliers in a given dataset using python

坐標= [(259,168),(62,133),(143,163),(174,270),(321,385)]

斜率= 0.76083799
截距= 77.87127406

在此處輸入圖片說明

對於我來說,帶有棕色標記的坐標是一個潛在的異常值,因此需要刪除。 到目前為止,我正在嘗試使用學生殘差和折刀殘差來消除這些異常值。 但是,鑒於我擁有的數據集,我無法計算這些殘差。

如果您的人員可以幫助我在上述數據集中查找殘差以及如何做到這一點,將非常有幫助。

import numpy as np
import matplotlib.pyplot as plt

coordinates = [(259, 168), (62, 133), (143, 163), (174, 270), (321, 385)]

x=[x1[0] for x1 in coordinates]
y=[x1[1] for x1 in coordinates]

for x1,y1 in coordinates:
   plt.plot(x1,y1,marker="o",color="brown")
plt.show()

# using numpy polyfit method to find regression line slope and intercept 
z = np.polyfit(x,y,1)
print(z)
slope = z[0]
intercept =z[1]

newx = np.linspace(62,321,200)
newy = np.poly1d(z)
plt.plot(x,y, 'o', newx, newy(newx),color="black")
# plt.plot()
plt.plot(259,168,marker="o",color="brown")
plt.show()

#TODO
#remove the outliers and then display

x和y開頭放置在np.ndarrays中。

輸入:

import numpy as np
import matplotlib.pyplot as plt

coordinates = [(259, 168), (62, 133), (143, 163), (174, 270), (321, 385)]

x=np.array([x1[0] for x1 in coordinates]) #Placed into array
y=np.array([x1[1] for x1 in coordinates]) #Placed into array

for x1,y1 in coordinates:
   plt.plot(x1,y1,marker="o",color="brown")
plt.show()

# using numpy polyfit method to find regression line slope and intercept 
z = np.polyfit(x,y,1)
print(z)
slope = z[0]
intercept =z[1]

newx = np.linspace(62,321,200)
newy = np.poly1d(z)
plt.plot(x,y, 'o', newx, newy(newx),color="black")
# plt.plot()
plt.plot(259,168,marker="o",color="brown")
plt.show()

附加代碼:

print("old y: " + repr(y)) #Display original array of y values
print("old x: " + repr(x)) 
residual_array = abs(y - (intercept + slope * x)) #Create an array of residuals
max_accept_deviation = 100 #An arbitrary value of "acceptable deviation"
mask = residual_array >= max_accept_deviation #Create an array of TRUE/FALSE values. TRUE where residual array is larger than deviation
rows_to_del = tuple(te for te in np.where(mask)[0]) #np.where converts the mask to a list of row numbers which is converted to a tuple
cleaned_y = np.delete(y,rows_to_del) #np.delete deletes all row numbers in the earlier tuple
cleaned_x = np.delete(x,rows_to_del)
print("new y: " + repr(cleaned_y)) #Print the cleaned values
print("new x: " + repr(cleaned_x))

輸出:

[  0.76083799  77.87127406]
old y: array([168, 133, 163, 270, 385])
old x: array([259,  62, 143, 174, 321])
new y: array([133, 163, 270, 385])
new x: array([ 62, 143, 174, 321])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM