[英]Handling NaNs in linear regression — scipy?
我有一個NaN
散布在數據中的數據集。 我正在使用熊貓從文件中提取數據,並使用numpy對其進行處理。 這是我讀取數據的代碼:
import pandas as pd
import numpy as np
def makeArray(band):
"""
Takes as argument a string as the name of a wavelength band.
Converts the list of magnitudes in that band into a numpy array,
replacing invalid values (where invalid == -999) with NaNs.
Returns the array.
"""
array_name = band + '_mag'
array = np.array(df[array_name])
array[array==-999]=np.nan
return array
# Read data file
fields = ['no', 'NED', 'z', 'obj_type','S_21', 'power', 'SI_flag',
'U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag', 'L_UV', 'Q', 'flag_uv']
magnitudes = ['U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag']
df = pd.read_csv('todo.dat', sep = ' ',
names = fields, index_col = False)
# Define axes for processing
redshifts = np.array(df['z'])
y = np.log(makeArray('K'))
mask = np.isnan(y)
我想一個最小的工作示例是:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
randomNumberGenerator = np.random.RandomState(1000)
x = 4 * randomNumberGenerator.rand(100)
y = 4 * x - 1+ randomNumberGenerator.randn(100)
y[50] = np.nan
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
fit = slope*x + intercept
plt.scatter(x, y)
plt.plot(x, fit)
plt.show()
y[50] = np.nan
MWE中的y[50] = np.nan
行會生成一個漂亮的圖形,但是包含它會產生與我的實際數據相同的錯誤消息:
C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)
實際數據框的摘要:
no NED z obj_type S_21 power SI_flag U_mag B_mag V_mag R_mag K_mag W1_mag W2_mag W3_mag W4_mag L_UV Q flag_uv
1 SDSSJ000005.95+145310.1 2.499 * 0.0 0.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 0.0 0.0 NONE
4 SDSSJ000009.27+020621.9 1.432 UvS 0.0 0.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 0.0 0.0 NONE
5 SDSSJ000009.38+135618.4 2.239 QSO 0.0 0.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 0.0 0.0 NONE
6 SDSSJ000011.37+150335.7 2.18 * 0.0 0.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 0.0 0.0 NONE
11 SDSSJ000030.64-064100.0 2.606 QSO 0.0 0.0 -999.0 -999.0 -999.0 -999.0 15.46 -999.0 -999.0 -999.0 -999.0 23.342 56.211000000000006 UV
15 SDSSJ000033.05+114049.6 0.73 UvS 0.0 0.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 0.0 0.0 NONE
27 LBQS2358+0038 0.95 QSO 0.0 0.0 17.342 18.483 18.203 17.825 -999.0 -999.0 -999.0 -999.0 -999.0 23.301 56.571999999999996 UV
我正在針對z
繪制每個_mag
列,並且試圖計算和繪制線性回歸(不包括NaN
。
我已經嘗試了numpy.linalg
, numpy.poly
, scipy.stats.linregress
和statsmodels.api
,但是似乎它們中的任何一個都不能輕易處理NaN
。 我在SE上發現的其他問題正在引導我轉圈。
如MWE所示,如何在數據上方繪制OLS回歸擬合?
您可以使用df.dropna()
參見以下鏈接: pandas.DataFrame.dropna
您必須將數據轉換為數據框,才能刪除包含至少一個NAN值的整個列。 這樣,您將不會收到前面收到的警告。 嘗試這個,
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
randomNumberGenerator = np.random.RandomState(1000)
x = 4 * randomNumberGenerator.rand(100)
y = 4 * x - 1+ randomNumberGenerator.randn(100)
y[50] = np.nan
df1 = pd.DataFrame({'x': x})
df1['y'] = y
df1 = df1.dropna()
x = df1.x
y = df1.y
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
fit = slope*x + intercept
plt.scatter(x, y)
plt.plot(x, fit)
plt.show()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.