简体   繁体   中英

Fill NaN values from previous column with data

I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?

For example,

      1     2   3     4     5   6   7  8  9  10  11    12    13  14    15    16
83  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0 16.0  17.0 NaN  28.0  30.0 NaN  28.0  18.0

The goal is for the data to look like this:

      1     2   3     4     5   6   7  ...    10  11    12    13  14    15    16
83  NaN  NaN NaN  27.0  29.0 29.0 30.0  ...  15.0 16.0  17.0  28.0 30.0  28.0  18.0

The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.

Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:

def fun(row):
    index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
    row.iloc[:] = row[index_order].values
    return row

df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)

If you need to overwrite the results in the same dataframe then:

df.loc[:,df.columns[1:]] = df_fix.copy()

Use function justify for improve performance with filter all columns without first by DataFrame.iloc :

print (df)
   name     1     2   3     4     5   6   7     8     9    10  11    12    13  \
80  bob  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0  16.0  17.0 NaN  28.0  30.0   

    14    15    16  
80 NaN  28.0  18.0  


df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
print (df)
   name   1   2   3   4   5     6     7     8     9    10    11    12    13  \
80  bob NaN NaN NaN NaN NaN  27.0  29.0  29.0  30.0  15.0  16.0  17.0  28.0   

      14    15    16  
80  30.0  28.0  18.0  

Function:

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

Performance :

#100 rows
df = pd.concat([df] * 100, ignore_index=True)

#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)

#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM