[英]Python filter with multiple conditions for string and date columns
have a CSV dataset and I need to filter it with conditions but the problem is that the condition can be true for multiple days.有一个 CSV 数据集,我需要使用条件对其进行过滤,但问题是条件可以持续多天。 What I want is to keep the last true value for these conditions within 3 days.我想要的是在 3 天内保持这些条件的最后一个真实值。
My dataset looks like this我的数据集看起来像这样
Date City Summary Flight No. Company
2-18-2019 NY Airplane land 23 Delta
2-18-2019 London Cargo handling 4 British
2-18-2019 Dubai Airplane land 92 Emirates
2-19-2019 Dubai Airplane stay 92 Emirates
2-19-2019 Paris Flight canceled 78 British
2-19-2019 LA Airplane Land 7 United
2-20-2019 Dubai Airplane land 92 Emirates
2-20-2019 LA Airplane land 29 Delta
2-20-2019 NY Airplane left 23 Delta
2-21-2019 Paris Airplane reschedule 78 British
2-21-2019 London Airplane land 4 British
2-21-2019 LA Airplane from NY land 29 Delta
~~~
3-10-2019 London Airplane land 5 KLM
3-10-2019 Paris Airplane Land 78 Air France
3-10-2019 LA Reschedule 29 United
3-11-2019 NY Cargo handled 23 Delta
3-11-2019 Dubai Arrived be4 2 days 34 Etihad
~~~
3-21-2019 Dubai Airplane land 92 Etihad
3-21-2019 New Delhi Reschedule 9 AirAsia
3-21-2019 London Cargo handling 5 Lufthansa
3-22-2019 New Delhi Airplane Land 9 AirAsia
3-22-2019 NY Reschedule 23 United
3-22-2019 Dubai Airplane land 35 Etihad
The code should check if Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') and if City == City and Flight No. == Flight No and Company == Company then return last entire within three days.代码应检查 Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') 以及如果 City == City and Flight No. == Flight No and Company == Company 然后最后返回三天内全部完成。 So if all conditions are true on the 18th and 20th the code should return the 20 only.因此,如果在 18 日和 20 日所有条件都为真,则代码应仅返回 20。 but if it is true for the 18th and 21th it should keep both.但如果 18 日和 21 日是真的,它应该保留两者。 Please note that not all columns with have the same data( not a duplicated rows)请注意,并非所有列都具有相同的数据(不是重复的行)
The desired output should look like the dataset below:所需的 output 应类似于以下数据集:
Date City Summary Flight No. Company
2-18-2019 NY Airplane land 23 Delta
2-19-2019 LA Airplane Land 7 United
2-20-2019 Dubai Airplane land 92 Emirates
2-21-2019 London Airplane land 4 British
2-21-2019 LA Airplane from NY land 29 Delta
~~~
3-10-2019 London Airplane land 5 KLM
3-10-2019 Paris Airplane Land 78 Air France
~~~
3-21-2019 Dubai Airplane land 92 Etihad
3-22-2019 New Delhi Airplane Land 9 AirAsia
3-22-2019 Dubai Airplane land 35 Etihad
My code is below but it doesn't work我的代码在下面,但它不起作用
import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta
df = pd.read_csv(r"C:\Airplanes.csv")
pd.set_option('display.max_columns', 500)
df = df.astype(str)
count = df.groupby(['City', 'Flight No.'])['No.'].transform('size')
df['Date'] = pd.to_datetime(df['Date'])
df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]
def filter(grp):
a = grp.Date + timedelta(days=2)
return grp[~grp.Date.isin(a)]
df = np.where((df['City'] == df['City']) & (df['Company'] == df['Company']) & (df['Flight No.'] == df['Flight No.']).apply(filter).reset_index(drop=True))
export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)
It return the below error它返回以下错误
AttributeError: 'bool' object has no attribute 'Date'
Please help me find a what to apply all condition and keep the last True entrie within specific days.请帮助我找到适用所有条件的内容,并在特定日期内保留最后一个 True 条目。
First, the condition you use inside np.where will always be True.首先,您在 np.where 中使用的条件将始终为 True。 And, it is unclear from the rest of the code provided what the columns 'Rig' and 'LinerSize'.而且,从代码的 rest 中还不清楚“Rig”和“LinerSize”列提供了什么。 Your use of np.where returns a tuple (array([0, 1, 2], dtype=int64),)
and the conditions inside are always True since we'll always have df['Rig'] == df['Rig']
etc. A common use of np.where would be to specify in addition a couple of values: one in case of True on your condition and the other in the False case.您对 np.where 的使用返回一个元组(array([0, 1, 2], dtype=int64),)
并且里面的条件总是 True 因为我们总是有df['Rig'] == df['Rig']
等。 np.where 的一个常见用途是另外指定几个值:一个在您的条件下为 True 的情况下,另一个在 False 的情况下。 Yet, this will return a Series, not the full data frame, to which you are trying to apply the filter function.然而,这将返回一个系列,而不是您尝试应用过滤器 function 的完整数据帧。 I suggest to use the filter like:我建议使用如下过滤器:
city_list = ['NY', 'LA'] # just an example
company_list = ['Delta', 'United']
flight_list = [23, 7, 92]
df_new = [(df['City'].isin(city_list)) &
(df['Company'].isin(company_list)) &
(df['Flight No'].isin(flight_list)]
That should help you get closer to what you want
First, we filter the DataFrame using contains
as you did:首先,我们像您一样使用contains
过滤 DataFrame:
>>> df_clean = df[(df['Summary'].str.lower().str.contains('airplane')) & (df['Summary'].str.lower().str.contains('land'))]
>>> df_clean = df_clean.reset_index(drop=True)
Then we manage the date diff using duplicated
and diff
like so to get the expected result:然后我们使用duplicated
和diff
来管理日期差异,以获得预期的结果:
df_clean['date_dt'] = pd.to_datetime(df_clean['Date'], format="%m-%d-%Y")
c = ['City', 'Flight No.', 'Company']
def f(x):
return (x[c].duplicated() & x['date_dt'].diff().dt.days.lt(4)).sort_values(ascending=False)
df_clean = df_clean.sort_values(c)
res = df_clean[~df_clean.groupby(c).apply(f).values]
res.sort_values('Date')
Output: Output:
Date City Summary Flight No. Company date_dt
0 2-18-2019 NY Airplane land 23 Delta 2019-02-18
2 2-19-2019 LA Airplane Land 7 United 2019-02-19
3 2-20-2019 Dubai Airplane land 92 Emirates 2019-02-20
6 2-21-2019 LA Airplane from NY land 29 Delta 2019-02-21
5 2-21-2019 London Airplane land 4 British 2019-02-21
7 3-10-2019 London Airplane land 5 KLM 2019-03-10
8 3-10-2019 Paris Airplane Land 78 Air France 2019-03-10
9 3-21-2019 Dubai Airplane land 92 Etihad 2019-03-21
10 3-22-2019 Dubai Airplane land 35 Etihad 2019-03-22
11 3-22-2019 New Delhi Airplane Land 9 AirAsia 2019-03-22
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.