獲取 dataframe 中不是 NaN 值的每行的列名 (Python)

Question

我有一個 dataframe，它有幾個特征，一個特征可以有一個 NaN 值。 例如

feature1    feature2    feature3   feature4
  10           NaN          5          2
  2            1            3          1
  NaN          2            4          NaN

注意：列也可以包含字符串。

我們如何獲得包含非 NaN 值的列名的每行列表/數組？

因此，我的示例的結果數組將是：

res = array([feature1, feature3, feature4], [feature1, feature2, feature3, feature4], 
[feature2, feature3])

Answer 1

您可以stack以僅保留非 NAN 值，並使用groupby.agg聚合為列表：

out = df.stack().reset_index().groupby('level_0')['level_1'].agg(list)

Output 作為系列：

level_0
0              [feature1, feature3, feature4]
1    [feature1, feature2, feature3, feature4]
2                        [feature2, feature3]
Name: level_1, dtype: object

如清單：

out = (df.stack().reset_index().groupby('level_0')['level_1']
         .agg(list).to_numpy().tolist()
       )

Output：

[['feature1', 'feature3', 'feature4'],
 ['feature1', 'feature2', 'feature3', 'feature4'],
 ['feature2', 'feature3']]

Answer 2

為了提高性能，請使用列表理解並將值轉換為 numpy 數組：

c = df.columns.to_numpy()
res = [c[x].tolist() for x in df.notna().to_numpy()]
print (res)
[['feature1', 'feature3', 'feature4'], 
 ['feature1', 'feature2', 'feature3', 'feature4'], 
 ['feature2', 'feature3']]

df = pd.concat([df] * 1000, ignore_index=True)
    

In [28]: %%timeit
    ...: out = (df.stack().reset_index().groupby('level_0')['level_1']
    ...:          .agg(list).to_numpy().tolist()
    ...:        )
    ...:        
    ...: 
96.5 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [29]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: res = [c[x].tolist() for x in df.notna().to_numpy()]
    ...: 
3.36 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

獲取 dataframe 中不是 NaN 值的每行的列名 (Python)

問題描述

2 個解決方案

解決方案1
1 2022-12-05 08:10:29

解決方案2
1 已采納 2022-12-05 08:10:34

獲取 dataframe 中不是 NaN 值的每行的列名 (Python)

問題描述

2 個解決方案

解決方案1 1 2022-12-05 08:10:29

解決方案2 1 已采納 2022-12-05 08:10:34

解決方案1
1 2022-12-05 08:10:29

解決方案2
1 已采納 2022-12-05 08:10:34