[英]Backfill values in Pandas series when value matches another column
I have a DataFrame like this:我有一个像这样的数据帧:
import numpy as np
raw_data = {'surface': [np.nan, np.nan, 'round', 'square'],
'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami']}
df = pd.DataFrame(raw_data, columns = ['surface', 'city'])
This looks like this:这看起来像这样:
surface city
0 NaN San Francisco
1 NaN Miami
2 round San Francisco
3 square Miami
I need earliest instance of the San Francisco row to be filled with 'round', and the earlier Miami row to be filled with 'square'.我需要用“圆形”填充旧金山行的最早实例,用“方形”填充较早的迈阿密行。 Using .fillna(method='bfill') won't take into account other column values, and just fills all earlier rows with round.使用 .fillna(method='bfill') 不会考虑其他列值,只会用圆形填充所有较早的行。
The result would be:结果将是:
surface city
0 round San Francisco
1 square Miami
2 round San Francisco
3 square Miami
You can use groupby.bfill
;您可以使用groupby.bfill
; group data frame by city column and then use bfill
:按城市列对数据框进行分组,然后使用bfill
:
df.groupby('city').bfill()
# surface city
#0 round San Francisco
#1 square Miami
#2 round San Francisco
#3 square Miami
[Modified based on the admirable answer from PSidom ] [根据PSidom的令人钦佩的回答修改]
Using groupby()
is the key point indeed, but it might be confusing not to mention what bfill()
does as it's not doing what you actually think it does.使用groupby()
确实是关键点,但更不用说bfill()
作用可能会令人困惑,因为它并没有按照您实际认为的那样做。
Let's take a quick glance at the doc here .让我们快速浏览一下这里的文档。 Instead of back filling the data like what the OP wants, it actually just fill in the missing data with non-missing data in the next column.而不是像OP想要的那样回填数据,它实际上只是在下一列中用非缺失数据填充缺失数据。 It works great with groupby()
in this case, while you also need to do groupby('*your group*').ffill()
for forward filling in case that the data you have are more complicated.在这种情况下,它与groupby()
配合得很好,而您还需要执行groupby('*your group*').ffill()
进行前向填充,以防您拥有的数据更复杂。
For further illustration, let's modify your data like this:为了进一步说明,让我们像这样修改您的数据:
import numpy as np
import pandas as pd
raw_data = {'surface': [np.nan, np.nan, 'round', 'square', np.nan, np.nan, np.nan, np.nan],
'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami', 'Miami', 'Miami', 'San Francisco', 'Miami']}
df = pd.DataFrame(raw_data, columns = ['surface', 'city'])
df
# surface city
#0 NaN San Francisco
#1 NaN Miami
#2 round San Francisco
#3 square Miami
#4 NaN Miami
#5 NaN Miami
#6 NaN San Francisco
#7 NaN Miami
With only df.groupby('city').bfill()
, you'll got:只有df.groupby('city').bfill()
,你会得到:
df2 = df.groupby('city').bfill()
df2
# surface city
#0 round San Francisco
#1 square Miami
#2 round San Francisco
#3 square Miami
#4 NaN Miami
#5 NaN Miami
#6 NaN San Francisco
#7 NaN Miami
See what is going on there?看看那里发生了什么? bfill()
did the job in row 0 and 1, but remain row 4 ~ 7 unchanged. bfill()
在第 0 行和第 1 行完成了工作,但保持第 4 ~ 7 行不变。 You should use both bfill()
and ffill()
instead.您应该同时使用bfill()
和ffill()
。 Maybe something like this:也许是这样的:
df3 = df2.groupby('city').ffill()
df3
# surface city
#0 round San Francisco
#1 square Miami
#2 round San Francisco
#3 square Miami
#4 square Miami
#5 square Miami
#6 round San Francisco
#7 square Miami
To be noticed, you shouldn't use something like df.groupby('city').bfill().ffill()
.需要注意的是,您不应该使用df.groupby('city').bfill().ffill()
。 It'll fill in something wrong there.它会在那里填补一些错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.