简体   繁体   English

如何根据前一列填充 NaN 值

[英]How to fill NaN values based on previous columns

I have an initial column with no missing data (A) but with repeated values.我有一个没有缺失数据(A)但有重复值的初始列。 How do I fill the next column (B) with missing data so that it is filled and the column on the left always has the same value on the right?如何用缺失的数据填充下一列 (B) 以便填充它并且左侧的列在右侧始终具有相同的值? I would also like any other columns to remain the same (C)我还希望任何其他列保持不变 (C)

For example, this is what I have例如,这就是我所拥有的

    A    B     C
1   1    20    4
2   2    NaN   8
3   3    NaN   2
4   2    30    9
5   3    40    1
6   1    NaN   3

And this is what I want这就是我想要的

    A    B     C
1   1    20    4
2   2    30*   8
3   3    40*   2
4   2    30    9
5   3    40    1
6   1    20*   3

Asterisk on filled values.填充值上的星号。

This needs to be scalable with a very large dataframe.这需要使用非常大的数据帧进行扩展。

Additionally, if I had a value on the left column that has more than one value on the right side on separate observations, how would I fill with the mean?此外,如果我在左列中有一个值,而在单独观察的右侧有多个值,我将如何填充平均值?

You can use groupby on 'A' and use first to find the first corresponding value in 'B' (it will not select NaN ).您可以在'A'上使用groupby并使用first'B'找到第一个对应的值(它不会选择NaN )。

import pandas as pd

df = pd.DataFrame({'A':[1,2,3,2,3,1], 
                   'B':[20, None, None, 30, 40, None], 
                   'C': [4,8,2,9,1,3]})

# find first 'B' value for each 'A'
lookup = df[['A', 'B']].groupby('A').first()['B']

# only use rows where 'B' is NaN
nan_mask = df['B'].isnull()

# replace NaN values in 'B' with lookup values
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)

print(df)

Which outputs:哪些输出:

   A     B  C
0  1  20.0  4
1  2  30.0  8
2  3  40.0  2
3  2  30.0  9
4  3  40.0  1
5  1  20.0  3

If there are many NaN values in 'B' you might want to exclude them before you use groupby .如果'B'有许多NaN值,您可能希望在使用groupby之前排除它们。

import pandas as pd

df = pd.DataFrame({'A':[1,2,3,2,3,1], 
                   'B':[20, None, None, 30, 40, None], 
                   'C': [4,8,2,9,1,3]})

# Only use rows where 'B' is NaN
nan_mask = df['B'].isnull()

# Find first 'B' value for each 'A'
lookup = df[~nan_mask][['A', 'B']].groupby('A').first()['B']

df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)

print(df)

You could do sort_values first then forward fill column B based on column A. The way to implement this will be:您可以先执行 sort_values,然后根据 A 列向前填充 B 列。实现这一点的方法是:

import pandas as pd
import numpy as np

x = {'A':[1,2,3,2,3,1],
     'B':[20,np.nan,np.nan,30,40,np.nan],
     'C':[4,8,2,9,1,3]}

df = pd.DataFrame(x)

#sort_values first, then forward fill based on column B
#this will get the right values for you while maintaing
#the original order of the dataframe
df['B'] = df.sort_values(by=['A','B'])['B'].ffill()
print (df)

Output will be:输出将是:

Original data:原始数据:

   A     B  C
0  1  20.0  4
1  2   NaN  8
2  3   NaN  2
3  2  30.0  9
4  3  40.0  1
5  1   NaN  3

Updated data:更新数据:

   A     B  C
0  1  20.0  4
1  2  30.0  8
2  3  40.0  2
3  2  30.0  9
4  3  40.0  1
5  1  20.0  3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM