简体   繁体   English

Python 过滤器具有字符串和日期列的多个条件

[英]Python filter with multiple conditions for string and date columns

have a CSV dataset and I need to filter it with conditions but the problem is that the condition can be true for multiple days.有一个 CSV 数据集,我需要使用条件对其进行过滤,但问题是条件可以持续多天。 What I want is to keep the last true value for these conditions within 3 days.我想要的是在 3 天内保持这些条件的最后一个真实值。

My dataset looks like this我的数据集看起来像这样

Date           City             Summary              Flight No.    Company
2-18-2019       NY            Airplane land              23         Delta 
2-18-2019     London          Cargo handling              4         British
2-18-2019      Dubai          Airplane land              92         Emirates
2-19-2019      Dubai          Airplane stay              92         Emirates
2-19-2019      Paris          Flight canceled            78         British
2-19-2019       LA            Airplane Land              7          United
2-20-2019      Dubai          Airplane land              92         Emirates
2-20-2019       LA            Airplane land              29         Delta
2-20-2019       NY            Airplane left              23         Delta
2-21-2019      Paris          Airplane reschedule        78         British
2-21-2019      London         Airplane land              4          British
2-21-2019       LA            Airplane from NY land      29         Delta
~~~
3-10-2019      London         Airplane land              5          KLM
3-10-2019      Paris          Airplane Land              78       Air France
3-10-2019       LA            Reschedule                 29         United
3-11-2019       NY            Cargo handled              23         Delta
3-11-2019      Dubai          Arrived be4 2 days         34         Etihad
~~~
3-21-2019      Dubai          Airplane land              92         Etihad
3-21-2019     New Delhi       Reschedule                 9          AirAsia
3-21-2019      London         Cargo handling             5         Lufthansa
3-22-2019     New Delhi       Airplane Land              9          AirAsia
3-22-2019       NY            Reschedule                 23         United
3-22-2019      Dubai          Airplane land              35         Etihad

The code should check if Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') and if City == City and Flight No. == Flight No and Company == Company then return last entire within three days.代码应检查 Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') 以及如果 City == City and Flight No. == Flight No and Company == Company 然后最后返回三天内全部完成。 So if all conditions are true on the 18th and 20th the code should return the 20 only.因此,如果在 18 日和 20 日所有条件都为真,则代码应仅返回 20。 but if it is true for the 18th and 21th it should keep both.但如果 18 日和 21 日是真的,它应该保留两者。 Please note that not all columns with have the same data( not a duplicated rows)请注意,并非所有列都具有相同的数据(不是重复的行)

The desired output should look like the dataset below:所需的 output 应类似于以下数据集:


Date           City             Summary              Flight No.    Company
2-18-2019       NY            Airplane land              23         Delta 
2-19-2019       LA            Airplane Land              7          United
2-20-2019      Dubai          Airplane land              92         Emirates
2-21-2019      London         Airplane land              4          British
2-21-2019       LA            Airplane from NY land      29         Delta
~~~
3-10-2019      London         Airplane land              5          KLM
3-10-2019      Paris          Airplane Land              78       Air France
~~~
3-21-2019      Dubai          Airplane land              92         Etihad
3-22-2019     New Delhi       Airplane Land              9          AirAsia
3-22-2019      Dubai          Airplane land              35         Etihad

My code is below but it doesn't work我的代码在下面,但它不起作用

import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta

df = pd.read_csv(r"C:\Airplanes.csv")

pd.set_option('display.max_columns', 500)
df = df.astype(str)



count = df.groupby(['City', 'Flight No.'])['No.'].transform('size')



df['Date'] = pd.to_datetime(df['Date'])

df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]


def filter(grp):
    a = grp.Date + timedelta(days=2)
    return grp[~grp.Date.isin(a)]

df = np.where((df['City'] == df['City']) & (df['Company'] == df['Company']) & (df['Flight No.'] == df['Flight No.']).apply(filter).reset_index(drop=True))


export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)

It return the below error它返回以下错误

AttributeError: 'bool' object has no attribute 'Date'

Please help me find a what to apply all condition and keep the last True entrie within specific days.请帮助我找到适用所有条件的内容,并在特定日期内保留最后一个 True 条目。

First, the condition you use inside np.where will always be True.首先,您在 np.where 中使用的条件将始终为 True。 And, it is unclear from the rest of the code provided what the columns 'Rig' and 'LinerSize'.而且,从代码的 rest 中还不清楚“Rig”和“LinerSize”列提供了什么。 Your use of np.where returns a tuple (array([0, 1, 2], dtype=int64),) and the conditions inside are always True since we'll always have df['Rig'] == df['Rig'] etc. A common use of np.where would be to specify in addition a couple of values: one in case of True on your condition and the other in the False case.您对 np.where 的使用返回一个元组(array([0, 1, 2], dtype=int64),)并且里面的条件总是 True 因为我们总是有df['Rig'] == df['Rig']等。 np.where 的一个常见用途是另外指定几个值:一个在您的条件下为 True 的情况下,另一个在 False 的情况下。 Yet, this will return a Series, not the full data frame, to which you are trying to apply the filter function.然而,这将返回一个系列,而不是您尝试应用过滤器 function 的完整数据帧。 I suggest to use the filter like:我建议使用如下过滤器:

city_list = ['NY', 'LA'] # just an example
company_list = ['Delta', 'United']
flight_list = [23, 7, 92]
df_new = [(df['City'].isin(city_list)) &
          (df['Company'].isin(company_list)) &
          (df['Flight No'].isin(flight_list)]
That should help you get closer to what you want

First, we filter the DataFrame using contains as you did:首先,我们像您一样使用contains过滤 DataFrame:

>>> df_clean = df[(df['Summary'].str.lower().str.contains('airplane')) & (df['Summary'].str.lower().str.contains('land'))]
>>> df_clean = df_clean.reset_index(drop=True)

Then we manage the date diff using duplicated and diff like so to get the expected result:然后我们使用duplicateddiff来管理日期差异,以获得预期的结果:

df_clean['date_dt'] = pd.to_datetime(df_clean['Date'], format="%m-%d-%Y")

c = ['City', 'Flight No.', 'Company']

def f(x):
    return (x[c].duplicated() & x['date_dt'].diff().dt.days.lt(4)).sort_values(ascending=False)

df_clean = df_clean.sort_values(c)
res = df_clean[~df_clean.groupby(c).apply(f).values]
res.sort_values('Date')

Output: Output:

    Date        City        Summary                 Flight No.  Company     date_dt
0   2-18-2019   NY          Airplane land           23          Delta       2019-02-18
2   2-19-2019   LA          Airplane Land           7           United      2019-02-19
3   2-20-2019   Dubai       Airplane land           92          Emirates    2019-02-20
6   2-21-2019   LA          Airplane from NY land   29          Delta       2019-02-21
5   2-21-2019   London      Airplane land           4           British     2019-02-21
7   3-10-2019   London      Airplane land           5           KLM         2019-03-10
8   3-10-2019   Paris       Airplane Land           78          Air France  2019-03-10
9   3-21-2019   Dubai       Airplane land           92          Etihad      2019-03-21
10  3-22-2019   Dubai       Airplane land           35          Etihad      2019-03-22
11  3-22-2019   New Delhi   Airplane Land           9           AirAsia     2019-03-22

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM