[英]Filtering DataFrame in pandas based on criteria from another DataFrame
I have two data frames, one with millions of rows of data and another one with rather only a few hundreds of records and I need to filter the first dataframe by three columns from second.我有两个数据框,一个有数百万行数据,另一个只有几百条记录,我需要从第二个开始按三列过滤第一个数据框。
So basically I need to iterate through each row in df2 and see if there any rows in df1 with same tickers and dates between the start and end date.所以基本上我需要遍历 df2 中的每一行,看看 df1 中是否有任何行在开始日期和结束日期之间具有相同的代码和日期。 Sadly, I have no clue how to perform this with python.
可悲的是,我不知道如何使用 python 执行此操作。
So my data frames are similar to the following所以我的数据框类似于以下
Ticker date
1 AA 2013-12-31
3 AA 2015-02-28
4 AA 2016-03-31
5 AA 2016-04-30
6 BB 2014-05-31
7 BB 2014-06-30
8 BB 2017-07-31
9 CC 2014-08-31
10 CC 2017-09-30
11 CC 2018-10-31
12 CC 2018-11-30
13 DD 2018-11-30
14 DD 2018-12-21
Second one:第二个:
Ticker StartDate EndDate
1 AA 2016-01-01 2017-01-01
2 BB 2014-01-01 2015-01-01
3 CC 2018-01-01 2019-01-01
4 AA 2013-01-01 2014-01-01
My expected result is filtered first data frame with all records for all tickers in df2 between start and end dates:我的预期结果是过滤第一个数据框,其中包含开始和结束日期之间 df2 中所有代码的所有记录:
Ticker date
1 AA 2013-12-31
2 AA 2016-03-31
3 AA 2016-04-30
4 BB 2014-05-31
5 BB 2014-06-30
6 CC 2018-11-30
UPD UPD
So i've tried the following:所以我尝试了以下方法:
df4 = pd.DataFrame()
###create empty dataframe
for index, row in df2.iterrows():
df3 =df1.loc[(df1['DATE']>=row['StartDate'])&(df1['DATE']<=row['EndDate'])&(df1['Ticker'] ==row['Ticker'])]
###Go through rows of dataframe2, for every row i look if there any rows in df1 that falls under criteria
df4 = df4.append(df3)
### append filtered results of one row to empty dataframe
It works but it takes ages - I've tried to filter 2% of my data and it took around 25 minutes它有效,但需要很长时间 - 我试图过滤 2% 的数据,大约需要 25 分钟
Is there any way to speed it up?有什么办法可以加快速度吗?
Try this:尝试这个:
df3 = df1.merge(df2)
df3 =df3.loc[(df3['date']>=df3['StartDate'])&(df3['date']<=df3['EndDate'])]
df3.drop(['date'], axis = 1)
It's look like you can use group by from create the date ranges for each Ticker看起来您可以使用 group by from 为每个 Ticker 创建日期范围
data = pd.read_clipboard()
flt_df = pd.read_clipboard()
data数据
Ticker date
1 AA 2013-12-31
3 AA 2015-02-28
4 AA 2016-03-31
5 AA 2016-04-30
6 BB 2014-05-31
7 BB 2014-06-30
8 BB 2017-07-31
9 CC 2014-08-31
10 CC 2017-09-30
11 CC 2018-10-31
12 CC 2018-11-30
13 DD 2018-11-30
14 DD 2018-12-21
flt_df flt_df
Ticker StartDate EndDate
1 AA 2016-01-01 2017-01-01
2 BB 2014-01-01 2015-01-01
3 CC 2018-01-01 2019-01-01
4 AA 2013-01-01 2014-01-01
grouped_df = flt_df.groupby('Ticker').agg({'StartDate':'min','EndDate':'max'})
merged = data.set_index('Ticker').join(grouped_df)
merged = merged[(merged.date>=merged.StartDate)&(merged.date<=merged.EndDate)]
merged.drop(['StartDate','EndDate'],axis=1,inplace=True)
merged合并
date
Ticker
AA 2013-12-31
AA 2015-02-28
AA 2016-03-31
AA 2016-04-30
BB 2014-05-31
BB 2014-06-30
CC 2018-10-31
CC 2018-11-30
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.