[英]Python filter row with multiple columns conditions
我有一个 CSV 数据集,我需要使用条件对其进行过滤,但问题是条件可以持续多天。 我想要的是保留此条件的最后一个真实值。
我的数据集看起来像这样
Date City Summary No.
2-18-2019 NY Airplane land 23
2-18-2019 London Cargo handling 4
2-18-2019 Dubai Airplane land 92
2-19-2019 Dubai Airplane stay 92
2-19-2019 Paris Flight canceled 78
2-19-2019 LA Airplane Land 7
2-20-2019 Dubai Airplane land 92
2-20-2019 LA Airplane land 29
2-20-2019 NY Airplane left 23
2-21-2019 Paris Airplane reschedule 78
2-21-2019 London Airplane land 4
2-21-2019 LA Airplane from NY land 29
~~~
3-10-2019 London Airplane land 5
3-10-2019 Paris Airplane Land 78
3-10-2019 LA Reschedule 29
3-11-2019 NY Cargo handled 23
3-11-2019 Dubai Arrived be4 2 days 34
~~~
3-21-2019 Dubai Airplane land 92
3-21-2019 New Delhi Reschedule 9
3-21-2019 London Cargo handling 5
3-22-2019 New Delhi Airplane Land 9
3-22-2019 NY Reschedule 23
3-22-2019 Dubai Airplane land 35
因此代码应该为我们提供飞机着陆的最后一个条目,其中City == City
和No. == No.
,正如您所见,这种情况可能会持续数天。 我想要的是检查条件是否为真两天,然后保留最后一天。
所需的 output 应类似于以下数据集:
Date City Summary No.
2-18-2019 NY Airplane land 23
2-19-2019 LA Airplane Land 7
2-20-2019 Dubai Airplane land 92
2-21-2019 London Airplane land 4
2-21-2019 LA Airplane from NY land 29
~~~
3-10-2019 London Airplane land 5
3-10-2019 Paris Airplane Land 78
~~~
3-21-2019 Dubai Airplane land 92
3-22-2019 New Delhi Airplane Land 9
3-22-2019 Dubai Airplane land 35
我的代码在下面,但它不起作用
import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta
df = pd.read_csv(r"C:\Airplanes.csv")
pd.set_option('display.max_columns', 500)
df = df.astype(str)
count = df.groupby(['City', 'No.'])['No.'].transform('size')
df['Date'] = pd.to_datetime(df['Date'])
df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]
def filter(grp):
a = grp.Date + timedelta(days=2)
return grp[~grp.Date.isin(a)]
df.groupby(['City']).apply(filter).reset_index(drop=True)
export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)
请帮忙修复
我认为你需要:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#filter case non sensitive
df=df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land', case=False))]
#mask for match if exist dates with subtract one day
m = df['Date'].isin(df['Date'] - pd.Timedelta(days=1))
#filter out duplicates if exist previous days
df = df[(m & ~df['Date'].duplicated()) | ~m]
print (df)
Date City Summary No.
0 2019-02-18 NY Airplane land 23
5 2019-02-19 LA Airplane Land 7
6 2019-02-20 Dubai Airplane land 92
10 2019-02-21 London Airplane land 4
11 2019-02-21 LA Airplane from NY land 29
12 2019-03-10 London Airplane land 5
13 2019-03-10 Paris Airplane Land 78
17 2019-03-21 Dubai Airplane land 92
20 2019-03-22 New Delhi Airplane Land 9
22 2019-03-22 Dubai Airplane land 92
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.