查找最接近的值對並在Python中計算均值

Question

我有一個數據框，如下所示：

import pandas as pd
import numpy as np
import random

np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 3)), 
                  columns=list('ABC'), 
                  index=['{}'.format(i) for i in range(100)])

ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
    df.iat[row, col] = np.nan

df = df.mask(np.random.random(df.shape) < .05)  #insert 5% of NaNs  

df.head()

    A   B   C
0  99  78  61
1  16  73   8
2  62  27  30
3  80   7  76
4  15  53  80

如果我想從columns A, B and C找到最接近的值對，並計算值對的平均值作為column D ？ 我該如何在熊貓中做到這一點？ 謝謝。

由於我的真實數據具有某些NaNs ，因此，如果某些行僅具有兩個值，則將其均值計算為columns D ；如果某些行僅具有一個值，則將其取入column D 。

我嘗試過計算每對的絕對值，從columns diffAB, diffAC and diffBC找到最小值，然后計算最小對的均值，但我認為這樣做更好。

cols = ['A', 'B', 'C']
df[cols]=df[cols].fillna(0)

df['diffAB'] = (df['A'] - df['B']).abs()
df['diffAC'] = (df['A'] - df['C']).abs()
df['diffBC'] = (df['B'] - df['C']).abs()

更新：

df['Count'] = df[['A', 'B', 'C']].apply(lambda x: sum(x.notnull()), axis=1)

if df['Count'] == 3:
    def meanFunc(row):
        minDiffPairIndex = np.argmin( [abs(row['A']-row['B']), abs(row['B']-row['C']), abs(row['C']-row['A']) ])      
        meanDict = {0: np.mean([row['A'], row['B']]), 1: np.mean([row['B'], row['C']]), 2: np.mean([row['C'], row['A']])}
        return meanDict[minDiffPairIndex]
if df['Count'] == 2:
    ...

預期結果：

    A   B   C   D
0  99  78  61  69.5
1  16  73   8   12
2  62  27  30  28.5
3  80   7  76   78
4  15  53  80  66.5

Answer 1

我在這里使用numpy：

In [11]: x = df.values

In [12]: x.sort()

In [13]: (x[:, 1:] + x[:, :-1])/2
Out[13]:
array([[69.5, 88.5],
       [12. , 44.5],
       [28.5, 46. ],
       [41.5, 78. ],
       [34. , 66.5]])

In [14]: np.diff(x)
Out[14]:
array([[17, 21],
       [ 8, 57],
       [ 3, 32],
       [69,  4],
       [38, 27]])

In [15]: np.diff(x).argmin(axis=1)
Out[15]: array([0, 0, 0, 1, 1])

In [16]: ((x[:, 1:] + x[:, :-1])/2)[np.arange(len(x)), np.diff(x).argmin(axis=1)]
Out[16]: array([69.5, 12. , 28.5, 78. , 66.5])

In [17]: df["D"] = ((x[:, 1:] + x[:, :-1])/2)[np.arange(len(x)), np.diff(x).argmin(axis=1)]

Answer 2

這可能不是最快的方法，但是非常簡單。

def func(x):
    a,b,c = x
    diffs = np.abs(np.array([a-b,a-c,b-c]))
    means = np.array([(a+b)/2,(a+c)/2,(b+c)/2])
    return means[diffs.argmin()]

df["D"] = df.apply(func,axis=1)
df.head()

Answer 3

假設您需要具有值對平均值的附加column D ，該值對在(colA, colB), (colB, colC) and (colC, colA)這三種可能的對中具有最小的差異，以下代碼應該可以工作：

更新：

def meanFunc(row):    
    nonNanValues = [x for x in list(row) if str(x) != 'nan']
    numOfNonNaN = len(nonNanValues) 
    if(numOfNonNaN == 0): return 0
    if(numOfNonNaN == 1): return nonNanValues[0]
    if(numOfNonNaN == 2): return np.mean(nonNanValues)
    if(numOfNonNaN == 3):
        minDiffPairIndex = np.argmin( [abs(row['A']-row['B']), abs(row['B']-row['C']), abs(row['C']-row['A']) ])      
        meanDict = {0: np.mean([row['A'], row['B']]), 1: np.mean([row['B'], row['C']]), 2: np.mean([row['C'], row['A']])}
        return meanDict[minDiffPairIndex]

df['D'] = df.apply(meanFunc, axis=1)

上面的代碼以以下方式處理行中的NaN值：如果所有三個值均為NaN則column D值為0 ；如果兩個值為NaN則將非NaN值分配給column D ；如果正好存在一個NaN則均值其余兩個中的第一個分配給column D 。

以前：

def meanFunc(row):
    minDiffPairIndex = np.argmin( [abs(row['A']-row['B']), abs(row['B']-row['C']), abs(row['C']-row['A']) ])      
    meanDict = {0: np.mean([row['A'], row['B']]), 1: np.mean([row['B'], row['C']]), 2: np.mean([row['C'], row['A']])}
    return meanDict[minDiffPairIndex]

df['D'] = df.apply(meanFunc, axis=1)

希望我能正確理解您的問題。

查找最接近的值對並在Python中計算均值

問題描述

3 個解決方案

解決方案1
3 2019-02-25 07:45:10

解決方案2
1 2019-02-25 08:07:47

解決方案3
1 已采納 2019-02-25 08:10:53

查找最接近的值對並在Python中計算均值

問題描述

3 個解決方案

解決方案1 3 2019-02-25 07:45:10

解決方案2 1 2019-02-25 08:07:47

解決方案3 1 已采納 2019-02-25 08:10:53

解決方案1
3 2019-02-25 07:45:10

解決方案2
1 2019-02-25 08:07:47

解決方案3
1 已采納 2019-02-25 08:10:53