简体   繁体   English

计算熊猫中一列中单个细胞的线性回归的问题

[英]Problems with calculating linear regression for individual cells in a column in pandas

I'm trying to implement a linear regression calculation for individual empty cells based on previous data in a column. 我正在尝试根据列中的先前数据为单个空单元格执行线性回归计算。 Since I do not understand how to use the python libraries, I wrote the whole calculation in steps. 由于我不了解如何使用python库,因此我分步编写了整个计算过程。

This is my dataframe: 这是我的数据框:

index   value    delta
-52       0      42517
-51       0      42524
-50      216     42531
-49      345     42538
-48      237     42545
...
 -2      367     42862
 -1      310     42869
  0      226     42876
  1      NaN     42883
  2      NaN     42890
...
 49      NaN     43213
 50      NaN     43220
 51      NaN     43227
 52      NaN     43234

Values where index = 0 and lower are always 52 . index = 0或更低的值始终是52 Above, there may be a different number, but it is known to me beforehand, in this example their 52 . 上面可能有一个不同的数字,但我事先知道,在此示例中为52 Unknown values always start where index = 1 . 未知值总是从index = 1开始。

For a single value, I'm counting so (here i count for value in column delta dd = 42883 ): 对于单个值,我这样计算(在这里,我在delta dd = 42883列中计算值):

x = dftest['delta']
y = dftest['value']
x_mean= np.mean(x)
y_mean = np.mean(y)
x_std = np.std(x)
y_std = np.std(y)
corr = np.corrcoef(y, x)[1,0]
slope = corr * y_std / x_std
intercept = y_mean - slope * x_mean
n_vl = intercept + slope * dd

So he calculates, but I do not understand how to write a loop, so that he does this for all empty cells (beginning with index = 1 ), while taking into account the previously calculated value. 因此,他进行了计算,但我不了解如何编写循环,因此他在考虑先前计算的值的同时,对所有空单元格(从index = 1开始)执行了此操作。

I tried to use the code that is in the first response here and change it, but it does not work. 我试图使用此处第一个响应中的代码并对其进行更改,但是它不起作用。

Below part of the code: 下面的代码部分:

vl = dftest['value'].values
delta =  dftest['delta'].values
for index in range(0, vl.shape[0]):
    if np.isnan(vl[index]):
        x = delta.take(range(index-52,index+1),mode='wrap')
        y = vl.take(range(index-52,index+1),mode='wrap')
        y1 = np.nanmean(vl.take(range(index-52,index+1),mode='wrap'))
        y2 = np.nanstd(vl.take(range(index-52,index+1),mode='wrap'))
        x1 = np.nanmean(delta.take(range(index-52,index+1),mode='wrap'))
        x2 = np.nanstd(delta.take(range(index-52,index+1),mode='wrap'))
        corr = np.corrcoef(y, x)[1,0] 
        slope = corr * y2 / x2
        intercept = y1 - slope * x1
        n_vl = intercept + slope * dd
print (y)
print (x)        
print (y1)
print (y2)
print (x1)
print (x2)
print (corr)
print (slope)
print (intercept)
print (n_vl)

But it takes a value below the index = 0 , not above. 但是它需要一个小于index = 0的值,而不是大于index = 0的值。 I do not know how to change this and how to write it so that it counts for every empty cell. 我不知道如何更改它以及如何编写它,以便它对每个空单元格都有效。

That's what I get at the output for one value (from my code with a loop). 这就是我在一个值的输出中得到的结果(来自循环代码)。

[ 226.   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan
nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan
nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan
nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan   nan
nan   nan   nan   nan   nan]
[42876 42883 42890 42897 42904 42911 42918 42925 42932 42939 42946 42953
 42960 42967 42974 42981 42988 42995 43002 43009 43016 43023 43030 43037
 43044 43051 43058 43065 43072 43079 43086 43093 43100 43107 43108 43115
 43122 43129 43136 43143 43150 43157 43164 43171 43178 43185 43192 43199
 43206 43213 43220 43227 43234]
226.0
0.0
43055.8490566
104.701263481
nan
nan
nan
nan

I have been stuck on this for a long time and can not move forward, I really need help. 我已经坚持了很长时间,无法前进,我真的需要帮助。

Just do 做就是了

dftest['value'].fillna(52)

which will fill all the NaN 's in the value column with number 52 . 这将用数字52填充value列中的所有NaN If you need to be extra sure to fill NaN 's only if index <= 0 (in other words you expect NaN 's in value column for index > 0 ) then do: 如果您需要特别确保仅在index <= 0时才填写NaN (换句话说,您期望index > 0 value index > 0 NaNvalue列中),然后执行以下操作:

dftest.loc[index <= 0, 'value'].fillna(52)

Remember, that if you feel you need to use loops in Pandas, you most likely doing it wrong. 请记住,如果您觉得需要在Pandas中使用循环,则很可能做错了。

So, I decided to fill the empty cells in the column using linear regression based on known data. 因此,我决定使用基于已知数据的线性回归来填充列中的空白单元格。

import statsmodels.formula.api as smf
#Here I choose the known data and fit the model
smresults = smf.ols('value ~ delta', df.iloc[:53]).fit()
smresults.summary()
#Here I fill empty cells using the model
df.value[53:] = smresults.predict(df.iloc[53:])

This is the best solution I managed to get. 这是我设法获得的最佳解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM