简体   繁体   中英

How to groupby df according to two column values and handling missing values in pandas?

I want to split a dataset with ~ 1 million rows based on a multiple occurring ID and a mode type (active, not active). When splitting, missing values in numeric columns should be interpolated and categorical values should be filled with ffill. Finally, the remaining zero values are to be dropped. For this I have written the following function:

objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)

def fill_missing_values(df_group):
    df_group[objectList] = df_group[objectList].ffill()
    df_group[floatList] = df_group[floatList].interpolate(
        method="linear", limit_direction="forward"
    )
    df_group.dropna()
    return df_group

The function is now to be applied as follows:

df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
    lambda df_sorted: fill_missing_values(df_sorted)
)

The cell is executed without errors, but the output takes way too long. So my question is: Is this approach overall correct or am I missing something? And how can this code get more performance?

Input Data

df = pd.DataFrame(
    {
        "ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
        "MODE": [
            "active",
            "active",
            "active",
            "inactive",
            "inactive",
            "active",
            "active",
            "active",
            "inactive",
        ],
        "Signal1  ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
        "Signal2  ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
        "Signal3  ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
    }
)

df

    ID  MODE     Signal1  Signal2  Signal3
0   0A  active   13       NaN      on
1   0A  active   NaN      0.1      NaN
2   0A  active   4        0.3      NaN
3   0A  inactive 11       NaN      off
4   0A  inactive NaN      4.5      NaN
5   1C  active   22       NaN      on
6   1C  active   25       2.0      NaN
7   1C  active   NaN      3.0      on
8   1C  inactive 19       NaN      NaN

Desired Output after ffill and interpolation of ID "0A":

    ID  MODE      Signal1     Signal2   Signal3
0   0A  active    13.0        NaN       on
1   0A  active    8.5         0.1       on
2   0A  active    4.0         0.3       on
3   0A  inactive  11.0        NaN       off
4   0A  inactive  11.0        4.5       off

Desired Output after dropna of ID "0A":

    ID  MODE    Signal1  Signal2    Signal3
0   0A  active  8.5      0.1        on
1   0A  active  4.0      0.3        on
    ID  MODE      Signal1    Signal2    Signal3
0   0A  inactive  11         4.5        off

IIUC, you want:

  1. groupby the ID and MODE columns and interpolate all numeric columns
  2. groupby the ID and MODE columns and ffill all non-numeric columns
import numpy as np

#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)

numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns

df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")

>>> df
   ID      MODE  Signal1  Signal2 Signal3
0  0A    active     13.0      NaN      on
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
3  0A  inactive     11.0      NaN     off
4  0A  inactive     11.0      4.5     off
5  1C    active     22.0      NaN      on
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on
8  1C  inactive     19.0      NaN     NaN

>>> df.dropna()
   ID      MODE  Signal1  Signal2 Signal3
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
4  0A  inactive     11.0      4.5     off
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on

First fill Signal1 with mean by:

df['Signal1']=df.groupby(['ID','MODE'])['Signal1'].apply(lambda x:x.fillna(x.mean()))

Next groupby to get Signal3 and merge it

signal3 = df[['ID','MODE','Signal3']].dropna().drop_duplicates()
signal3 = signal3.rename(columns={'Signal3':'Signal3_new'})
df2 = pd.merge(df,signal3, how='left', on=['ID','MODE'])

fillna in Signal3 with Signal3_new or use Signal3_new

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM