简体   繁体   English

仅从数据框中的一行中删除 Nan 值

[英]Drop only Nan values from a row in a dataframe

I have a dataframe which looks something like this:我有一个看起来像这样的数据框:

Df
lev1    lev2   lev3    lev4   lev5   description
RD21    Nan    Nan     Nan    Nan    Oil
Nan     RD32   Nan     Nan    Nan    Oil/Canola
Nan     Nan    RD33    Nan    Nan    Oil/Canola/Wheat
Nan     Nan    RD34    Nan    Nan    Oil/Canola/Flour
Nan     Nan    Nan     RD55   Nan    Oil/Canola/Flour/Thick
ED54    Nan    Nan     Nan    Nan    Rice
Nan     ED66   Nan     Nan    Nan    Rice/White
Nan     Nan    ED88    Nan    Nan    Rice/White/Jasmine
Nan     Nan    ED89    Nan    Nan    Rice/White/Basmati
Nan     ED68   Nan     Nan    Nan    Rice/Brown

I want to remove all the NaN values and just keep the non Nan values, something like this:我想删除所有 NaN 值并保留非 Nan 值,如下所示:

DF2
code     description
RD21     Oil
RD32     Oil/Canola
RD33     Oil/Canola/Wheat
RD34     Oil/Canola/Flour
RD55     Oil/Canola/Flour/Thick
.
.
.

How do I do this?我该怎么做呢? I tried using notna() method, but it returns a boolean value of the dataframe.我尝试使用 notna() 方法,但它返回数据帧的布尔值。 Any help would be appreciated.任何帮助,将不胜感激。

You can use stack and groupby like this to find the fist non null value,您可以像这样使用 stack 和 groupby 来查找第一个非空值,

df['code'] = df[['lev1', 'lev2', 'lev3', 'lev4', 'lev5']].stack().groupby(level=0).first().reindex(df.index)

Now, you can select the code column and description column现在,您可以选择代码列和描述列

df[['code', 'description']]


   code             description
0  RD21                     Oil
1  RD32              Oil/Canola
2  RD33        Oil/Canola/Wheat
3  RD34        Oil/Canola/Flour
4  RD55  Oil/Canola/Flour/Thick
5  ED54                    Rice
6  ED66              Rice/White
7  ED88      Rice/White/Jasmine
8  ED89      Rice/White/Basmati
9  ED68              Rice/Brown

We can mask by notna()我们可以通过notna()屏蔽

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        'l1': [np.nan, 5],
        'l2': [6, np.nan],
        'd': ['a', 'b']
     }
)

notna = df1[['l1', 'l2']].notna().values
notna_values = df1[['l1', 'l2']].values[notna]
print(notna_values)

df2 = pd.DataFrame(df1['d'])
df2['code'] = notna_values

print(df2)

out:出去:

   d  code
0  a   6.0
1  b   5.0

You can apply a function over every row in df[cols] (the subview over problematic columns), dropping every NaN and taking the only one remaining.您可以在df[cols]每一行(有问题的列的子视图)上应用一个函数,删除每个NaN并仅保留一个。

>>> cols = "lev1    lev2   lev3    lev4   lev5".split()
>>> df["code"] = df[cols].apply(lambda row: row.dropna().iloc[0])

You can also drop the original columns if you don't need them anymore with drop .您也可以删除原始列,如果你不与需要它们了drop

>>> df.drop(columns=cols, inplace=True)

Select columns like lev and replace NaN with "back fill".选择列如lev并用“ lev替换NaN Keep the first level lev1 and concat to your other column(s).保留第一级lev1并连接到您的其他列。

>>> pd.concat([df.filter(like='lev').bfill(axis='columns')['lev1'].rename('code'),
               df['description']], axis="columns")

   code             description
0  RD21                     Oil
1  RD32              Oil/Canola
2  RD33        Oil/Canola/Wheat
3  RD34        Oil/Canola/Flour
4  RD55  Oil/Canola/Flour/Thick
5  ED54                    Rice
6  ED66              Rice/White
7  ED88      Rice/White/Jasmine
8  ED89      Rice/White/Basmati
9  ED68              Rice/Brown

or using melt :或使用melt

>>> df.melt('description', value_name='code') \
      .dropna().drop(columns='variable') \
      .reset_index(drop=True) \
      [['code', 'description']]

   code             description
0  RD21                     Oil
1  ED54                    Rice
2  RD32              Oil/Canola
3  ED66              Rice/White
4  ED68              Rice/Brown
5  RD33        Oil/Canola/Wheat
6  RD34        Oil/Canola/Flour
7  ED88      Rice/White/Jasmine
8  ED89      Rice/White/Basmati
9  RD55  Oil/Canola/Flour/Thick

With your dataframe, I'd first make sure that Nan is actual np.NaN and not a string saying 'Nan'.使用您的数据np.NaN ,我首先要确保Nan是实际的np.NaN而不是一个表示“Nan”的字符串。 Then I'd want to make sure that they're imputed as empty strings.然后我想确保它们被归为空字符串。 Thus,因此,

df.replace('Nan', np.nan, inplace=True)
df.fillna('', inplace=True)

Afterwards,然后,

df['code'] = df['lev1'] + df['lev2'] + df['lev3'] + df['lev4'] + df['lev5']

And then df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True) to dispose of the old columns.然后df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True)处理旧列。

Note that this works only with the assumption given in OP's comment that there is one unique code in the five columns and the others are all NaN .请注意,这仅适用于 OP 评论中给出的假设,即五列中有一个唯一code ,其他code都是NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM