在線性回歸中處理NaN-科學嗎？

Question

我有一個NaN散布在數據中的數據集。 我正在使用熊貓從文件中提取數據，並使用numpy對其進行處理。 這是我讀取數據的代碼：

import pandas as pd
import numpy as np

def makeArray(band):
    """
    Takes as argument a string as the name of a wavelength band.
    Converts the list of magnitudes in that band into a numpy array,
    replacing invalid values (where invalid == -999) with NaNs.
    Returns the array.
    """
    array_name = band + '_mag'
    array = np.array(df[array_name])
    array[array==-999]=np.nan
    return array

#   Read data file
fields = ['no', 'NED', 'z', 'obj_type','S_21', 'power', 'SI_flag', 
          'U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
          'W2_mag', 'W3_mag', 'W4_mag', 'L_UV', 'Q', 'flag_uv']

magnitudes = ['U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
          'W2_mag', 'W3_mag', 'W4_mag']

df = pd.read_csv('todo.dat', sep = ' ',
                   names = fields, index_col = False)

#   Define axes for processing
redshifts = np.array(df['z'])
y = np.log(makeArray('K'))
mask = np.isnan(y)

我想一個最小的工作示例是：

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

randomNumberGenerator = np.random.RandomState(1000)
x = 4 * randomNumberGenerator.rand(100)
y = 4 * x - 1+ randomNumberGenerator.randn(100)
y[50] = np.nan

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
fit = slope*x + intercept

plt.scatter(x, y)
plt.plot(x, fit)
plt.show()

y[50] = np.nan MWE中的y[50] = np.nan行會生成一個漂亮的圖形，但是包含它會產生與我的實際數據相同的錯誤消息：

C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

實際數據框的摘要：

no  NED z   obj_type    S_21    power   SI_flag U_mag   B_mag   V_mag   R_mag   K_mag   W1_mag  W2_mag  W3_mag  W4_mag  L_UV    Q   flag_uv
1   SDSSJ000005.95+145310.1 2.499   *   0.0 0.0     -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  0.0 0.0 NONE
4   SDSSJ000009.27+020621.9 1.432   UvS 0.0 0.0     -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  0.0 0.0 NONE
5   SDSSJ000009.38+135618.4 2.239   QSO 0.0 0.0     -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  0.0 0.0 NONE
6   SDSSJ000011.37+150335.7 2.18    *   0.0 0.0     -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  0.0 0.0 NONE
11  SDSSJ000030.64-064100.0 2.606   QSO 0.0 0.0     -999.0  -999.0  -999.0  -999.0  15.46   -999.0  -999.0  -999.0  -999.0  23.342  56.211000000000006  UV
15  SDSSJ000033.05+114049.6 0.73    UvS 0.0 0.0     -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  -999.0  0.0 0.0 NONE
27  LBQS2358+0038   0.95    QSO 0.0 0.0     17.342  18.483  18.203  17.825  -999.0  -999.0  -999.0  -999.0  -999.0  23.301  56.571999999999996  UV

我正在針對z繪制每個_mag列，並且試圖計算和繪制線性回歸（不包括NaN 。

我已經嘗試了numpy.linalg ， numpy.poly ， scipy.stats.linregress和statsmodels.api ，但是似乎它們中的任何一個都不能輕易處理NaN 。 我在SE上發現的其他問題正在引導我轉圈。

如MWE所示，如何在數據上方繪制OLS回歸擬合？

Answer 1

您可以使用df.dropna()參見以下鏈接： pandas.DataFrame.dropna

Answer 2

您必須將數據轉換為數據框，才能刪除包含至少一個NAN值的整個列。 這樣，您將不會收到前面收到的警告。 嘗試這個，

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd

randomNumberGenerator = np.random.RandomState(1000)
x = 4 * randomNumberGenerator.rand(100)
y = 4 * x - 1+ randomNumberGenerator.randn(100)
y[50] = np.nan

df1 = pd.DataFrame({'x': x})
df1['y'] = y
df1 = df1.dropna()
x = df1.x
y = df1.y

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
fit = slope*x + intercept

plt.scatter(x, y)
plt.plot(x, fit)
plt.show()

在線性回歸中處理NaN-科學嗎？

問題描述

2 個解決方案

解決方案1
2 2018-07-23 06:33:43

解決方案2
1 已采納 2018-07-23 06:47:34

在線性回歸中處理NaN-科學嗎？

問題描述

2 個解決方案

解決方案1 2 2018-07-23 06:33:43

解決方案2 1 已采納 2018-07-23 06:47:34

解決方案1
2 2018-07-23 06:33:43

解決方案2
1 已采納 2018-07-23 06:47:34