简体   繁体   English

熊猫在每行中获得最高的非空值,在具有可变列数的数据框中

[英]Pandas get highest non-null value in each row, in dataframe with variable number of columns

I have a dataframe with following sample data, where the number of Columns in Col.x format is unknown:我有一个包含以下示例数据的数据框,其中Col.x格式的列数未知:

Col.1,Col.2,Col.3
Val1, 
Val2,Val3
Val3,
Val4,Val2,Val3

I need to have a separate column with values populated from the highest number of x which is not null .我需要有一个单独的列,其中的值是从非 null 的最大 x 数填充的 Such as:如:

Col.1,Col.2,Col.3,Latest
Val1,,,Val1
Val2,Val3,,Val3
Val3,,,Val3
Val4,Val2,Val3,Val3

I was able to solve the problem with code below but this solution depends on a) knowing the exact column names and b) doesn't handle the variable number of columns in a scalable way:我能够用下面的代码解决这个问题,但这个解决方案取决于a)知道确切的列名和b)不以可扩展的方式处理可变数量的列:

df["Latest"] = np.where(df["Col.3"].isnull(),np.where(df["Col.2"].isnull(),df["Col.1"],df["Col.2"]),df["Col.3"])

Part a) I can solve... a) 我可以解决...

cols = [col for col in df.columns if 'Col' in col]

... I need help with part b). ...我需要 b) 部分的帮助。

We can use filter to extract certain columns.我们可以使用filter来提取某些列。 like and regex are two powerful options that can be used. likeregex是两个可以使用的强大选项。

Given:鉴于:

    Col1  Col2  Col3  Ignore_me
0   18.0   NaN  40.0       82.0
1    6.0   NaN   NaN       92.0
2  100.0   NaN  19.0       43.0
3   38.0  98.0   NaN        8.0

Doing:正在做:

df['Latest'] = (df[df.filter(like='Col') # Using filter to select certain columns.
                     .columns
                     .sort_values(ascending=False)] # Sort them descending.
                  .bfill(axis=1) # backfill values
                  .iloc[:,0]) # take the first column, 
                              # This has the first non-nan value.

Output, we can see that Ignore_me wasn't used:输出,我们可以看到Ignore_me没有被使用:

    Col1  Col2  Col3  Ignore_me  Latest
0   18.0   NaN  40.0       82.0    40.0
1    6.0   NaN   NaN       92.0     6.0
2  100.0   NaN  19.0       43.0    19.0
3   38.0  98.0   NaN        8.0    98.0

Use fillna with functools.reduce :fillnafunctools.reduce一起使用:

# sort column names by suffix in reverse order
cols = sorted(
   (col for col in df.columns if col.startswith('Col')), 
   key=lambda col: -int(col.split('.')[1])
)
cols
# ['Col.3', 'Col.2', 'Col.1']

from functools import reduce
df['Latest'] = reduce(lambda x, y: x.fillna(y), [df[col] for col in cols])

df
#  Col.1 Col.2 Col.3 Latest
#0  Val1   NaN   NaN   Val1
#1  Val2   NaN  Val3   Val3
#2  Val3   NaN   NaN   Val3
#3  Val4  Val2  Val3   Val3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM