简体   繁体   English

熊猫:如果满足条件删除行

[英]pandas: remove rows if condition is met

I have a dataframe like:我有一个数据框,如:

date                airport_id  plane_type    runway
2020-01-01            11        333           3
2020-01-01            11        222           3
2020-01-02            11        333           3
2020-01-02            11        222           3
2020-01-03            11        333           3
2020-01-04            11        222           3
2020-01-01            12        222           3
2020-01-01            12        345           4

On a given date, no two types of plane ( plane_type ) can be present if they have a same runway , removing the row with bigger plane_type在给定的日期,没有两种类型的飞机 ( plane_type ) 如果它们具有相同的runway ,则删除具有更大plane_type的行

Expected output:预期输出:

date                airport_id  plane_type    runway
2020-01-01            11        222           3
2020-01-02            11        222           3
2020-01-03            11        333           3
2020-01-04            11        222           3
2020-01-01            12        222           3
2020-01-01            12        345           4

Any help would be very much appreciated!任何帮助将不胜感激! Thank you谢谢

It appears that you want to take the smallest plane_type for a given date , airport_id and plane_type .看来,要占用最小的plane_type对于给定的dateairport_idplane_type You could do that via a groupby statement as follows:您可以通过groupby语句执行此操作,如下所示:

result = (
    df.groupby(['date', 'airport_id', 'runway'], as_index=False)['plane_type'].min()
   .sort_values(['airport_id', 'runway'])
)
>>> result
         date  airport_id  runway  plane_type
0  2020-01-01          11       3         222
3  2020-01-02          11       3         222
4  2020-01-03          11       3         333
5  2020-01-04          11       3         222
1  2020-01-01          12       3         222
2  2020-01-01          12       4         345

You can then merge additional columns (eg city and country ) back to this result, assuming that the values are unique for the given merge key.然后,您可以将其他列(例如citycountry )合并回此结果,假设值对于给定的合并键是唯一的。

result.merge(df, on=['date', 'airport_id', 'runway', 'plane_type'])

From your expected output I see that there should be added a requirement concerning the airport_id :从您的预期输出中,我看到应该添加有关airport_id的要求:

  • no two types of plane...,没有两种类型的飞机...,
  • in any given airport_id (this is the part to add),在任何给定的airport_id 中(这是要添加的部分),
  • if they have a same runway .如果他们有相同的跑道

To generate this result, run:要生成此结果,请运行:

result = df.groupby(['date', 'airport_id', 'runway'], as_index=False,
    sort=False).apply(lambda grp: grp[grp.plane_type == grp.plane_type.min()])\
    .reset_index(level=0, drop=True)

The result is:结果是:

        date  airport_id  plane_type  runway
1 2020-01-01          11         222       3
3 2020-01-02          11         222       3
4 2020-01-03          11         333       3
5 2020-01-04          11         222       3
6 2020-01-01          12         222       3
7 2020-01-01          12         345       4

Try also another concept, ie:也尝试另一个概念,即:

  • first set the grouping columns as the index,首先将分组列设置为索引,
  • then groupby ,然后分组
  • and the last point - change the index columns back to "normal" columns.最后一点 - 将索引列更改回“正常”列。

The code to do it is:执行此操作的代码是:

result = df.set_index(['date', 'airport_id', 'runway'])\
    .groupby(['date', 'airport_id', 'runway'], as_index=False)\
    .apply(lambda grp: grp[grp.plane_type == grp.plane_type.min()])\
    .reset_index(level=[1,2,3])

Access by the index should be considerably faster, so if the execution speed is the problem, this may be a better approach.通过索引访问应该会快很多,所以如果执行速度是问题,这可能是一个更好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM