简体   繁体   English

根据来自另一个 DataFrame 的标准过滤 Pandas 中的 DataFrame

[英]Filtering DataFrame in pandas based on criteria from another DataFrame

I have two data frames, one with millions of rows of data and another one with rather only a few hundreds of records and I need to filter the first dataframe by three columns from second.我有两个数据框,一个有数百万行数据,另一个只有几百条记录,我需要从第二个开始按三列过滤第一个数据框。

So basically I need to iterate through each row in df2 and see if there any rows in df1 with same tickers and dates between the start and end date.所以基本上我需要遍历 df2 中的每一行,看看 df1 中是否有任何行在开始日期和结束日期之间具有相同的代码和日期。 Sadly, I have no clue how to perform this with python.可悲的是,我不知道如何使用 python 执行此操作。

So my data frames are similar to the following所以我的数据框类似于以下

     Ticker    date
1    AA       2013-12-31 
3    AA       2015-02-28 
4    AA       2016-03-31 
5    AA       2016-04-30 
6    BB       2014-05-31 
7    BB       2014-06-30 
8    BB       2017-07-31 
9    CC       2014-08-31 
10   CC       2017-09-30 
11   CC       2018-10-31 
12   CC       2018-11-30 
13   DD       2018-11-30 
14   DD       2018-12-21

Second one:第二个:

     Ticker    StartDate   EndDate
1    AA       2016-01-01   2017-01-01
2    BB       2014-01-01   2015-01-01
3    CC       2018-01-01   2019-01-01
4    AA       2013-01-01   2014-01-01

My expected result is filtered first data frame with all records for all tickers in df2 between start and end dates:我的预期结果是过滤第一个数据框,其中包含开始和结束日期之间 df2 中所有代码的所有记录:

   Ticker     date
1    AA       2013-12-31  
2    AA       2016-03-31 
3    AA       2016-04-30 
4    BB       2014-05-31 
5    BB       2014-06-30  
6    CC       2018-11-30 

UPD UPD

So i've tried the following:所以我尝试了以下方法:

df4 = pd.DataFrame()
###create empty dataframe
for index, row in df2.iterrows():
    df3 =df1.loc[(df1['DATE']>=row['StartDate'])&(df1['DATE']<=row['EndDate'])&(df1['Ticker'] ==row['Ticker'])]
###Go through rows of dataframe2, for every row i look if there any rows in df1 that falls under criteria 
    df4 = df4.append(df3)
### append filtered results of one row to empty dataframe 

It works but it takes ages - I've tried to filter 2% of my data and it took around 25 minutes它有效,但需要很长时间 - 我试图过滤 2% 的数据,大约需要 25 分钟

Is there any way to speed it up?有什么办法可以加快速度吗?

Try this:尝试这个:

df3 = df1.merge(df2)
df3 =df3.loc[(df3['date']>=df3['StartDate'])&(df3['date']<=df3['EndDate'])]
df3.drop(['date'], axis = 1)

It's look like you can use group by from create the date ranges for each Ticker看起来您可以使用 group by from 为每个 Ticker 创建日期范围

data = pd.read_clipboard()
flt_df = pd.read_clipboard()

data数据

   Ticker        date
1      AA  2013-12-31
3      AA  2015-02-28
4      AA  2016-03-31
5      AA  2016-04-30
6      BB  2014-05-31
7      BB  2014-06-30
8      BB  2017-07-31
9      CC  2014-08-31
10     CC  2017-09-30
11     CC  2018-10-31
12     CC  2018-11-30
13     DD  2018-11-30
14     DD  2018-12-21

flt_df flt_df

  Ticker   StartDate     EndDate
1     AA  2016-01-01  2017-01-01
2     BB  2014-01-01  2015-01-01
3     CC  2018-01-01  2019-01-01
4     AA  2013-01-01  2014-01-01

grouped_df = flt_df.groupby('Ticker').agg({'StartDate':'min','EndDate':'max'})
merged = data.set_index('Ticker').join(grouped_df)
merged = merged[(merged.date>=merged.StartDate)&(merged.date<=merged.EndDate)]
merged.drop(['StartDate','EndDate'],axis=1,inplace=True)

merged合并

              date
Ticker            
AA      2013-12-31
AA      2015-02-28
AA      2016-03-31
AA      2016-04-30
BB      2014-05-31
BB      2014-06-30
CC      2018-10-31
CC      2018-11-30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM