[英]Drop only Nan values from a row in a dataframe
I have a dataframe which looks something like this:我有一个看起来像这样的数据框:
Df
lev1 lev2 lev3 lev4 lev5 description
RD21 Nan Nan Nan Nan Oil
Nan RD32 Nan Nan Nan Oil/Canola
Nan Nan RD33 Nan Nan Oil/Canola/Wheat
Nan Nan RD34 Nan Nan Oil/Canola/Flour
Nan Nan Nan RD55 Nan Oil/Canola/Flour/Thick
ED54 Nan Nan Nan Nan Rice
Nan ED66 Nan Nan Nan Rice/White
Nan Nan ED88 Nan Nan Rice/White/Jasmine
Nan Nan ED89 Nan Nan Rice/White/Basmati
Nan ED68 Nan Nan Nan Rice/Brown
I want to remove all the NaN values and just keep the non Nan values, something like this:我想删除所有 NaN 值并保留非 Nan 值,如下所示:
DF2
code description
RD21 Oil
RD32 Oil/Canola
RD33 Oil/Canola/Wheat
RD34 Oil/Canola/Flour
RD55 Oil/Canola/Flour/Thick
.
.
.
How do I do this?我该怎么做呢? I tried using notna() method, but it returns a boolean value of the dataframe.
我尝试使用 notna() 方法,但它返回数据帧的布尔值。 Any help would be appreciated.
任何帮助,将不胜感激。
You can use stack and groupby like this to find the fist non null value,您可以像这样使用 stack 和 groupby 来查找第一个非空值,
df['code'] = df[['lev1', 'lev2', 'lev3', 'lev4', 'lev5']].stack().groupby(level=0).first().reindex(df.index)
Now, you can select the code column and description column现在,您可以选择代码列和描述列
df[['code', 'description']]
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
We can mask by notna()
我们可以通过
notna()
屏蔽
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'l1': [np.nan, 5],
'l2': [6, np.nan],
'd': ['a', 'b']
}
)
notna = df1[['l1', 'l2']].notna().values
notna_values = df1[['l1', 'l2']].values[notna]
print(notna_values)
df2 = pd.DataFrame(df1['d'])
df2['code'] = notna_values
print(df2)
out:出去:
d code
0 a 6.0
1 b 5.0
You can apply a function over every row in df[cols]
(the subview over problematic columns), dropping every NaN
and taking the only one remaining.您可以在
df[cols]
每一行(有问题的列的子视图)上应用一个函数,删除每个NaN
并仅保留一个。
>>> cols = "lev1 lev2 lev3 lev4 lev5".split()
>>> df["code"] = df[cols].apply(lambda row: row.dropna().iloc[0])
You can also drop the original columns if you don't need them anymore with drop
.您也可以删除原始列,如果你不与需要它们了
drop
。
>>> df.drop(columns=cols, inplace=True)
Select columns like lev
and replace NaN
with "back fill".选择列如
lev
并用“ lev
替换NaN
。 Keep the first level lev1
and concat to your other column(s).保留第一级
lev1
并连接到您的其他列。
>>> pd.concat([df.filter(like='lev').bfill(axis='columns')['lev1'].rename('code'),
df['description']], axis="columns")
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
or using melt
:或使用
melt
:
>>> df.melt('description', value_name='code') \
.dropna().drop(columns='variable') \
.reset_index(drop=True) \
[['code', 'description']]
code description
0 RD21 Oil
1 ED54 Rice
2 RD32 Oil/Canola
3 ED66 Rice/White
4 ED68 Rice/Brown
5 RD33 Oil/Canola/Wheat
6 RD34 Oil/Canola/Flour
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 RD55 Oil/Canola/Flour/Thick
With your dataframe, I'd first make sure that Nan
is actual np.NaN
and not a string saying 'Nan'.使用您的数据
np.NaN
,我首先要确保Nan
是实际的np.NaN
而不是一个表示“Nan”的字符串。 Then I'd want to make sure that they're imputed as empty strings.然后我想确保它们被归为空字符串。 Thus,
因此,
df.replace('Nan', np.nan, inplace=True)
df.fillna('', inplace=True)
Afterwards,然后,
df['code'] = df['lev1'] + df['lev2'] + df['lev3'] + df['lev4'] + df['lev5']
And then df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True)
to dispose of the old columns.然后
df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True)
处理旧列。
Note that this works only with the assumption given in OP's comment that there is one unique code
in the five columns and the others are all NaN
.请注意,这仅适用于 OP 评论中给出的假设,即五列中有一个唯一
code
,其他code
都是NaN
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.