[英]pandas: remove rows if condition is met
I have a dataframe like:我有一个数据框,如:
date airport_id plane_type runway
2020-01-01 11 333 3
2020-01-01 11 222 3
2020-01-02 11 333 3
2020-01-02 11 222 3
2020-01-03 11 333 3
2020-01-04 11 222 3
2020-01-01 12 222 3
2020-01-01 12 345 4
On a given date, no two types of plane ( plane_type
) can be present if they have a same runway
, removing the row with bigger plane_type
在给定的日期,没有两种类型的飞机 (
plane_type
) 如果它们具有相同的runway
,则删除具有更大plane_type
的行
Expected output:预期输出:
date airport_id plane_type runway
2020-01-01 11 222 3
2020-01-02 11 222 3
2020-01-03 11 333 3
2020-01-04 11 222 3
2020-01-01 12 222 3
2020-01-01 12 345 4
Any help would be very much appreciated!任何帮助将不胜感激! Thank you
谢谢
It appears that you want to take the smallest plane_type
for a given date
, airport_id
and plane_type
.看来,要占用最小的
plane_type
对于给定的date
, airport_id
和plane_type
。 You could do that via a groupby
statement as follows:您可以通过
groupby
语句执行此操作,如下所示:
result = (
df.groupby(['date', 'airport_id', 'runway'], as_index=False)['plane_type'].min()
.sort_values(['airport_id', 'runway'])
)
>>> result
date airport_id runway plane_type
0 2020-01-01 11 3 222
3 2020-01-02 11 3 222
4 2020-01-03 11 3 333
5 2020-01-04 11 3 222
1 2020-01-01 12 3 222
2 2020-01-01 12 4 345
You can then merge additional columns (eg city
and country
) back to this result, assuming that the values are unique for the given merge key.然后,您可以将其他列(例如
city
和country
)合并回此结果,假设值对于给定的合并键是唯一的。
result.merge(df, on=['date', 'airport_id', 'runway', 'plane_type'])
From your expected output I see that there should be added a requirement concerning the airport_id :从您的预期输出中,我看到应该添加有关airport_id的要求:
To generate this result, run:要生成此结果,请运行:
result = df.groupby(['date', 'airport_id', 'runway'], as_index=False,
sort=False).apply(lambda grp: grp[grp.plane_type == grp.plane_type.min()])\
.reset_index(level=0, drop=True)
The result is:结果是:
date airport_id plane_type runway
1 2020-01-01 11 222 3
3 2020-01-02 11 222 3
4 2020-01-03 11 333 3
5 2020-01-04 11 222 3
6 2020-01-01 12 222 3
7 2020-01-01 12 345 4
Try also another concept, ie:也尝试另一个概念,即:
The code to do it is:执行此操作的代码是:
result = df.set_index(['date', 'airport_id', 'runway'])\
.groupby(['date', 'airport_id', 'runway'], as_index=False)\
.apply(lambda grp: grp[grp.plane_type == grp.plane_type.min()])\
.reset_index(level=[1,2,3])
Access by the index should be considerably faster, so if the execution speed is the problem, this may be a better approach.通过索引访问应该会快很多,所以如果执行速度是问题,这可能是一个更好的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.