在 Pandas DataFrame 中外推值

Question

在 Pandas DataFrame 中插入 NaN 單元非常容易：

In [98]: df
Out[98]:
            neg       neu       pos       avg
250    0.508475  0.527027  0.641292  0.558931
500         NaN       NaN       NaN       NaN
1000   0.650000  0.571429  0.653983  0.625137
2000        NaN       NaN       NaN       NaN
3000   0.619718  0.663158  0.665468  0.649448
4000        NaN       NaN       NaN       NaN
6000        NaN       NaN       NaN       NaN
8000        NaN       NaN       NaN       NaN
10000       NaN       NaN       NaN       NaN
20000       NaN       NaN       NaN       NaN
30000       NaN       NaN       NaN       NaN
50000       NaN       NaN       NaN       NaN

[12 rows x 4 columns]

In [99]: df.interpolate(method='nearest', axis=0)
Out[99]:
            neg       neu       pos       avg
250    0.508475  0.527027  0.641292  0.558931
500    0.508475  0.527027  0.641292  0.558931
1000   0.650000  0.571429  0.653983  0.625137
2000   0.650000  0.571429  0.653983  0.625137
3000   0.619718  0.663158  0.665468  0.649448
4000        NaN       NaN       NaN       NaN
6000        NaN       NaN       NaN       NaN
8000        NaN       NaN       NaN       NaN
10000       NaN       NaN       NaN       NaN
20000       NaN       NaN       NaN       NaN
30000       NaN       NaN       NaN       NaN
50000       NaN       NaN       NaN       NaN

[12 rows x 4 columns]

我還希望它使用給定的方法推斷插值范圍之外的 NaN 值。 我怎樣才能最好地做到這一點？

Answer 1

外推 Pandas `DataFrame`

DataFrame可能是外推的，但是，pandas 中沒有簡單的方法調用，需要另一個庫（例如scipy.optimize ）。

外推

一般來說，外推需要對被外推的數據做出某些假設。 一種方法是通過曲線擬合數據的一些通用參數化方程，以找到最能描述現有數據的參數值，然后用於計算超出此數據范圍的值。 這種方法的困難和限制問題是在選擇參數化方程時必須對趨勢做出一些假設。 這可以通過對不同方程的反復試驗來找到，以獲得所需的結果，或者有時可以從數據源中推斷出來。 問題中提供的數據確實不夠大，無法獲得擬合良好的曲線； 然而，它足以說明。

下面是外推的一個例子DataFrame具有³階多項式

f ( x ) = a x ³ + b x ² + c x + d （等式 1）

此通用函數 ( func() ) 是曲線擬合到每一列以獲得唯一的列特定參數（即a 、 b 、 c 、 d ）。 然后，這些參數化方程用於為所有具有NaN的索引外推每列中的數據。

import pandas as pd
from cStringIO import StringIO
from scipy.optimize import curve_fit

df = pd.read_table(StringIO('''
                neg       neu       pos       avg
    0           NaN       NaN       NaN       NaN
    250    0.508475  0.527027  0.641292  0.558931
    500         NaN       NaN       NaN       NaN
    1000   0.650000  0.571429  0.653983  0.625137
    2000        NaN       NaN       NaN       NaN
    3000   0.619718  0.663158  0.665468  0.649448
    4000        NaN       NaN       NaN       NaN
    6000        NaN       NaN       NaN       NaN
    8000        NaN       NaN       NaN       NaN
    10000       NaN       NaN       NaN       NaN
    20000       NaN       NaN       NaN       NaN
    30000       NaN       NaN       NaN       NaN
    50000       NaN       NaN       NaN       NaN'''), sep='\s+')

# Do the original interpolation
df.interpolate(method='nearest', xis=0, inplace=True)

# Display result
print ('Interpolated data:')
print (df)
print ()

# Function to curve fit to the data
def func(x, a, b, c, d):
    return a * (x ** 3) + b * (x ** 2) + c * x + d

# Initial parameter guess, just to kick off the optimization
guess = (0.5, 0.5, 0.5, 0.5)

# Create copy of data to remove NaNs for curve fitting
fit_df = df.dropna()

# Place to store function parameters for each column
col_params = {}

# Curve fit each column
for col in fit_df.columns:
    # Get x & y
    x = fit_df.index.astype(float).values
    y = fit_df[col].values
    # Curve fit column and get curve parameters
    params = curve_fit(func, x, y, guess)
    # Store optimized parameters
    col_params[col] = params[0]

# Extrapolate each column
for col in df.columns:
    # Get the index values for NaNs in the column
    x = df[pd.isnull(df[col])].index.astype(float).values
    # Extrapolate those points with the fitted function
    df[col][x] = func(x, *col_params[col])

# Display result
print ('Extrapolated data:')
print (df)
print ()

print ('Data was extrapolated with these column functions:')
for col in col_params:
    print ('f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col]))

推斷結果

Interpolated data:
            neg       neu       pos       avg
0           NaN       NaN       NaN       NaN
250    0.508475  0.527027  0.641292  0.558931
500    0.508475  0.527027  0.641292  0.558931
1000   0.650000  0.571429  0.653983  0.625137
2000   0.650000  0.571429  0.653983  0.625137
3000   0.619718  0.663158  0.665468  0.649448
4000        NaN       NaN       NaN       NaN
6000        NaN       NaN       NaN       NaN
8000        NaN       NaN       NaN       NaN
10000       NaN       NaN       NaN       NaN
20000       NaN       NaN       NaN       NaN
30000       NaN       NaN       NaN       NaN
50000       NaN       NaN       NaN       NaN

Extrapolated data:
               neg          neu         pos          avg
0         0.411206     0.486983    0.631233     0.509807
250       0.508475     0.527027    0.641292     0.558931
500       0.508475     0.527027    0.641292     0.558931
1000      0.650000     0.571429    0.653983     0.625137
2000      0.650000     0.571429    0.653983     0.625137
3000      0.619718     0.663158    0.665468     0.649448
4000      0.621036     0.969232    0.708464     0.766245
6000      1.197762     2.799529    0.991552     1.662954
8000      3.281869     7.191776    1.702860     4.058855
10000     7.767992    15.272849    3.041316     8.694096
20000    97.540944   150.451269   26.103320    91.365599
30000   381.559069   546.881749   94.683310   341.042883
50000  1979.646859  2686.936912  467.861511  1711.489069

Data was extrapolated with these column functions:
f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112
f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870
f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098
f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312

繪制`avg`列

如果沒有更大的數據集或不知道數據的來源，這個結果可能完全錯誤，但應該舉例說明外推DataFrame的過程。 在假定方程func()可能會需要與以獲得正確的推斷播放。 此外，也沒有嘗試使代碼高效。

更新：

如果您的索引是非數字的，例如DatetimeIndex ，請參閱此答案以了解如何推斷它們。

Answer 2

import pandas as pd
try:
    # for Python2
    from cStringIO import StringIO 
except ImportError:
    # for Python3
    from io import StringIO

df = pd.read_table(StringIO('''
                neg       neu       pos       avg
    0           NaN       NaN       NaN       NaN
    250    0.508475  0.527027  0.641292  0.558931
    999         NaN       NaN       NaN       NaN
    1000   0.650000  0.571429  0.653983  0.625137
    2000        NaN       NaN       NaN       NaN
    3000   0.619718  0.663158  0.665468  0.649448
    4000        NaN       NaN       NaN       NaN
    6000        NaN       NaN       NaN       NaN
    8000        NaN       NaN       NaN       NaN
    10000       NaN       NaN       NaN       NaN
    20000       NaN       NaN       NaN       NaN
    30000       NaN       NaN       NaN       NaN
    50000       NaN       NaN       NaN       NaN'''), sep='\s+')

print(df.interpolate(method='nearest', axis=0).ffill().bfill())

產量

            neg       neu       pos       avg
0      0.508475  0.527027  0.641292  0.558931
250    0.508475  0.527027  0.641292  0.558931
999    0.650000  0.571429  0.653983  0.625137
1000   0.650000  0.571429  0.653983  0.625137
2000   0.650000  0.571429  0.653983  0.625137
3000   0.619718  0.663158  0.665468  0.649448
4000   0.619718  0.663158  0.665468  0.649448
6000   0.619718  0.663158  0.665468  0.649448
8000   0.619718  0.663158  0.665468  0.649448
10000  0.619718  0.663158  0.665468  0.649448
20000  0.619718  0.663158  0.665468  0.649448
30000  0.619718  0.663158  0.665468  0.649448
50000  0.619718  0.663158  0.665468  0.649448

注意：我稍微更改了您的df以顯示與nearest插值與執行df.fillna不同。 （請參閱索引為 999 的行。）

我還添加了一行索引為 0 的 NaN，以表明bfill()也可能是必要的。

Answer 3

我遇到了同樣的問題，但我找不到任何特定於熊貓的簡單而有用的（沒有定義新函數）。 但是，我發現InterpolatedUnivariateSpline （來自 scipy）對於外推非常有用。 它可以為您提供更改訂單的靈活性，而不是給您一個常數。

這是相關的例子：

import matplotlib.pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
x = np.linspace(-3, 3, 50)
y = np.exp(-x**2) + 0.1 * np.random.randn(50)
spl = InterpolatedUnivariateSpline(x, y)
plt.plot(x, y, 'ro', ms=5)
xs = np.linspace(-3, 3, 1000)
plt.plot(xs, spl(xs), 'g', lw=3, alpha=0.7)
plt.show()

在 Pandas DataFrame 中外推值

問題描述

3 個解決方案

解決方案1
29 已采納 2016-03-12 15:57:55

外推 Pandas `DataFrame`

外推

推斷結果

繪制`avg`列

解決方案2
6 2014-03-18 21:57:39

解決方案3
1 2021-03-08 20:22:12

在 Pandas DataFrame 中外推值

問題描述

3 個解決方案

解決方案1 29 已采納 2016-03-12 15:57:55

外推 Pandas DataFrame

外推

推斷結果

繪制avg列

解決方案2 6 2014-03-18 21:57:39

解決方案3 1 2021-03-08 20:22:12

解決方案1
29 已采納 2016-03-12 15:57:55

外推 Pandas `DataFrame`

繪制`avg`列

解決方案2
6 2014-03-18 21:57:39

解決方案3
1 2021-03-08 20:22:12