简体   繁体   English

获取 dataframe 中不是 NaN 值的每行的列名 (Python)

[英]Fetch the column names per row in a dataframe that are not NaN-values (Python)

I have a dataframe that has several features and a feature can have a NaN-value.我有一个 dataframe,它有几个特征,一个特征可以有一个 NaN 值。 Eg例如

feature1    feature2    feature3   feature4
  10           NaN          5          2
  2            1            3          1
  NaN          2            4          NaN

Note: the columns can also contain strings.注意:列也可以包含字符串。

How could we get a list/array per row that contains the column name of non NaN-values?我们如何获得包含非 NaN 值的列名的每行列表/数组?

Thus the result array of my example would be:因此,我的示例的结果数组将是:

res = array([feature1, feature3, feature4], [feature1, feature2, feature3, feature4], 
[feature2, feature3])

You can stack to keep only the non-NAN values, and aggregate as list with groupby.agg :您可以stack以仅保留非 NAN 值,并使用groupby.agg聚合为列表:

out = df.stack().reset_index().groupby('level_0')['level_1'].agg(list)

Output as Series: Output 作为系列:

0              [feature1, feature3, feature4]
1    [feature1, feature2, feature3, feature4]
2                        [feature2, feature3]
Name: level_1, dtype: object

As lists:如清单:

out = (df.stack().reset_index().groupby('level_0')['level_1']

Output: Output:

[['feature1', 'feature3', 'feature4'],
 ['feature1', 'feature2', 'feature3', 'feature4'],
 ['feature2', 'feature3']]

For improve performance use list comprehension with convert values to numpy array:为了提高性能,请使用列表理解并将值转换为 numpy 数组:

c = df.columns.to_numpy()
res = [c[x].tolist() for x in df.notna().to_numpy()]
print (res)
[['feature1', 'feature3', 'feature4'], 
 ['feature1', 'feature2', 'feature3', 'feature4'], 
 ['feature2', 'feature3']]

df = pd.concat([df] * 1000, ignore_index=True)

In [28]: %%timeit
    ...: out = (df.stack().reset_index().groupby('level_0')['level_1']
    ...:          .agg(list).to_numpy().tolist()
    ...:        )
96.5 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [29]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: res = [c[x].tolist() for x in df.notna().to_numpy()]
3.36 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM