简体   繁体   中英

Pandas dataframe replace NaN with a nearest minimum value in column

I have a pandas dataframe with column named as 'A_col', and I would like to create new column called 'A_col_fill', which will replace NaN in 'A_col' with a minimum value just prior to it if there is one. The sample output looks like below.

            A_col           A_col_fill
0            NaN                 NaN
1            NaN                 NaN
2            NaN                 NaN
3            NaN                 NaN
4            NaN                 NaN
5            NaN                 NaN
6            NaN                 NaN
7           -0.3400             -0.3400
8            NaN                -0.3400
9            NaN                -0.3400
10          -0.1900             -0.1900
11            NaN               -0.1900
12          -0.3700             -0.3700
13          -0.4100             -0.4100
14          -0.3300             -0.3300
15            NaN               -0.4100
16            NaN               -0.4100
17            NaN               -0.4100
18            NaN               -0.4100
19            NaN               -0.4100
20          -1.6500             -1.6500
21          -1.8000             -1.8000
22          -1.5300             -1.5300
23          -1.3500             -1.3500
24            NaN               -1.8000
25          -0.1900             -0.1900
26          -0.1400             -0.1400
28          -0.2100             -0.2100

Looks like Dataframe 'fillna' function don't work with case, How can I implement this, any code snippet are highly appreciated!

p['A_col'].fillna(np.inf).replace(np.inf,p['A_col'].ffill().cummin())

output:

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7    -0.34
8    -0.34
9    -0.34
10   -0.19
11   -0.34
12   -0.37
13   -0.41
14   -0.33
15   -0.41
16   -0.41
17   -0.41
18   -0.41
19   -0.41
20   -1.65
21   -1.80
22   -1.53
23   -1.35
24   -1.80
25   -0.19
26   -0.14
28   -0.21

This solution will fillna via the following with the minimum value of the last "island" of contiguous rows that contain values. It should be more accurate and performant than other suggested solutions (at the expense of complication):

  • create a column with a group number for each "island" of contiguous values or nans
  • get min value for each group; forward fill nan rows with previous min
  • fillna of the original column with the new min-per-group column

code:

df["group_col"] = np.cumsum(df["A_col"].isna() != df["A_col"].isna().shift())
df["group_min"] = df.groupby("group_col").A_col.transform(min).ffill()
df["output"] = df["A_col"].fillna(df.group_min)

result:

    A_col  A_col_fill  group_col  group_min  output
0     NaN         NaN          1        NaN     NaN
1     NaN         NaN          1        NaN     NaN
2     NaN         NaN          1        NaN     NaN
3     NaN         NaN          1        NaN     NaN
4     NaN         NaN          1        NaN     NaN
5     NaN         NaN          1        NaN     NaN
6     NaN         NaN          1        NaN     NaN
7   -0.34       -0.34          2      -0.34   -0.34
8     NaN       -0.34          3      -0.34   -0.34
9     NaN       -0.34          3      -0.34   -0.34
10  -0.19       -0.19          4      -0.19   -0.19
11    NaN       -0.19          5      -0.19   -0.19
12  -0.37       -0.37          6      -0.41   -0.37
13  -0.41       -0.41          6      -0.41   -0.41
14  -0.33       -0.33          6      -0.41   -0.33
15    NaN       -0.41          7      -0.41   -0.41
16    NaN       -0.41          7      -0.41   -0.41
17    NaN       -0.41          7      -0.41   -0.41
18    NaN       -0.41          7      -0.41   -0.41
19    NaN       -0.41          7      -0.41   -0.41
20  -1.65       -1.65          8      -1.80   -1.65
21  -1.80       -1.80          8      -1.80   -1.80
22  -1.53       -1.53          8      -1.80   -1.53
23  -1.35       -1.35          8      -1.80   -1.35
24    NaN       -1.80          9      -1.80   -1.80
25  -0.19       -0.19         10      -0.21   -0.19
26  -0.14       -0.14         10      -0.21   -0.14
28  -0.21       -0.21         10      -0.21   -0.21

The solution takes milliseconds for a 1M row df on my machine:

df = pd.DataFrame(np.random.random(size=100000), columns=["A_col"])
df.loc[df.sample(frac=0.6).index, "A_col"] = np.nan
# code from above
df["group_col"] = np.cumsum(df["A_col"].isna() != df["A_col"].isna().shift())
df["group_min"] = df.groupby("group_col").A_col.transform(min).ffill()
df["output"] = df["A_col"].fillna(df.group_min)

Simple solution, just iterate over the column and keep minimum all time and fill Nan value

def fill_min(df):
  minx = np.inf
  ans = []
  for val in df['A_Col']:
    if np.isnan(val):
      ans.append(val if np.isinf(minx) else minx)
    else:
      minx = min(minx, val)
      ans.append(val)
  return ans

USE:

df['A_col_fill'] = fill_min(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM