[英]Pandas: for each row count occurrence in another df within specific dates
我有以下 2 個 dfs:
df1
|company|company_id| date | max_date |
| A21 | 5 |2021-02-04| 2021-02-11|
| A21 | 10 |2020-10-04| 2020-10-11|
| B52 | 8 |2021-03-04| 2021-04-11|
| B52 | 6 |2020-04-04| 2020-04-11|
-------------------------------------------
和df2:
|company|company_id| date_df2 |
| A21 | 5 |2021-02-05|
| A21 | 5 |2021-02-08|
| A21 | 5 |2021-02-12|
| A21 | 5 |2021-02-11|
| A21 | 10 |2020-10-07|
| B52 | 8 |2021-03-07|
| B52 | 6 |2020-04-08|
| B52 | 6 |2020-04-12|
| B52 | 6 |2020-04-05|
-------------------------------
邏輯:對於df1
中的每個company
和company_id
,我想計算 df2 中出現的次數,其中df2
中的date_df2 in df2
date and max_date from df1
之間
預期成績:
|company|company_id| date | max_date |count|
| A21 | 5 |2021-02-04| 2021-02-11| 3 |
| A21 | 10 |2020-10-04| 2020-10-11| 1 |
| B52 | 8 |2021-03-04| 2021-04-11| 1 |
| B52 | 6 |2020-04-04| 2020-03-11| 2 |
------------------------------------------------
這如何在 pandas 中實現?
重現df的代碼:
#df1
list_columns = ['company','company_id','date','max_date']
list_data = [
['A21',5,'2021-02-04','2021-02-11'],
['A21',10,'2020-10-04','2020-10-11'],
['B52',8,'2021-03-04','2021-04-11'],
['B52',6,'2020-04-04','2020-04-11']
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
#df2
list_columns = ['company','company_id','date']
list_data = [
['A21',5,'2021-02-05'],
['A21',5,'2021-02-08'],
['A21',5,'2021-02-12'],
['A21',5,'2021-02-11'],
['A21',10,'2020-10-07'],
['B52',8,'2021-03-07'],
['B52',6,'2020-04-08'],
['B52',6,'2020-04-12'],
['B52',6,'2020-04-05']
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
使用DataFrame.merge
和默認內部連接,然后通過Series.between
過濾匹配的值,通過GroupBy.size
和 append 新列進行聚合計數,如有必要,替換缺失值:
df1['date'] = pd.to_datetime(df1['date'])
df1['max_date'] = pd.to_datetime(df1['max_date'])
df2['date'] = pd.to_datetime(df2['date'])
df = df1.merge(df2, on=['company','company_id'], suffixes=('','_'))
s = (df[df['date_'].between(df['date'], df['max_date'])]
.groupby(['company','company_id'])
.size())
df1 = df1.join(s.rename('count'), on=['company','company_id']).fillna({'count':0})
print (df1)
company company_id date max_date count
0 A21 5 2021-02-04 2021-02-11 3
1 A21 10 2020-10-04 2020-10-11 1
2 B52 8 2021-03-04 2021-04-11 1
3 B52 6 2020-04-04 2020-04-11 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.