[英]Pandas Groupby nunique count based on grouping of 2 date lists
Similar to this question, but adding one more step: Rolling groupby nunique count based on start and end dates与此问题类似,但再增加一个步骤: 根据开始和结束日期滚动 groupby 唯一计数
I have a dataframe with a unique ID, a start date, end date, start year, and end year.我有一个带有唯一 ID、开始日期、结束日期、开始年份和结束年份的数据框。 Over the course of this time, the ID can start, stop, and be restarted.
在这段时间内,ID 可以启动、停止和重新启动。
I would like to get a groupby nunique count of IDs over the course of all year.我想在全年中获得一个 groupby 唯一的 ID 计数。 Currently, I can count unique values for a start and end date of the ID, but how exactly do I incorporate including the years?
目前,我可以计算 ID 的开始日期和结束日期的唯一值,但我究竟如何合并包括年份在内?
fun = pd.DataFrame({'ZIP_KEY': ['A', 'B', 'A'],
'start_month': [1, 2, 2],
'end_month': [4, 3, 7],
'start_year': [2016, 2016, 2017],
'end_year': [2016, 2017, 2018]})
fun["month_list"] = fun.apply(lambda x: list(range(x["start_month"], x["end_month"]+1)), axis=1)
fun["year_list"] = fun.apply(lambda x: list(range(x["start_year"], x["end_year"]+1)), axis=1)
fun = fun.explode("month_list")
fun = fun.explode("year_list")
fun.groupby(["year_list", "month_list"])["ZIP_KEY"].nunique()
year_list month_list
2016 1 1
2 2
3 2
4 1
2017 2 2
3 2
4 1
5 1
6 1
7 1
2018 2 1
3 1
4 1
5 1
6 1
7 1
If a Zip Key is multi year, my current method is not taking into account full year --> Starts Jan 2018, Ends Feb 2020, then we get [1,2]
and [2018,2019,2020]
, not the full years for 2018 and 2019. I should get counts [1,2,3,4,5,6,7,8,9,10,11,12]
for [2018, 2019]
, and [1,2]
for 2020如果 Zip Key 是多年的,我目前的方法没有考虑全年 --> 从 2018 年 1 月开始,到 2020 年 2 月结束,然后我们得到
[1,2]
和[2018,2019,2020]
,而不是全年2018 年和 2019 年。我应该得到[1,2,3,4,5,6,7,8,9,10,11,12]
的[2018, 2019]
和[1,2]
的 2020 年
Similar to my other answer, but this time we use pd.date_range
with 'MS'
frequency instead of range
.与我的其他答案类似,但这次我们使用
pd.date_range
和'MS'
频率而不是range
。 It's helpful to first create datetime
columns that are the first of the month for the provided year-month combinations.首先为所提供的年月组合创建每月第一天的
datetime
列会很有帮助。
import pandas as pd
# Create start and end datetime column.
for per in ['start', 'end']:
fun[per] = pd.to_datetime(fun[[f'{per}_year', f'{per}_month']]
.rename(columns={f'{per}_year': 'year', f'{per}_month': 'month'})
.assign(day=1))
df = pd.concat([pd.DataFrame({'date': pd.date_range(st, en, freq='MS'), 'key': k})
for k, st, en in zip(fun['ZIP_KEY'], fun['start'], fun['end'])])
Now group for the output.现在分组输出。 If you want separate columns:
如果你想要单独的列:
df.groupby([df.date.dt.year.rename('year'), df.date.dt.month.rename('month')]).key.nunique()
year month
2016 1 1 # <━┓
2 2 # <━╋━━┓
3 2 # A ┃
4 2 # <━┛ ┃
5 1 # ┃
6 1 # ┃
7 1 # ┃
8 1 # B
9 1 # ┃
10 1 # ┃
11 1 # ┃
12 1 # ┃
2017 1 1 # ┃
2 2 # <━━━━╋━┓
3 2 # <━━━━┛ ┃
4 1 # ┃
5 1 # ┃
6 1 # ┃
7 1 # ┃
8 1 # ┃
9 1 # ┃
10 1 # A
11 1 # ┃
12 1 # ┃
2018 1 1 # ┃
2 1 # ┃
3 1 # ┃
4 1 # ┃
5 1 # ┃
6 1 # ┃
7 1 # <━━━━━━┛
I sometimes prefer grouping by the period:我有时更喜欢按时期分组:
df.groupby(df.date.dt.to_period('M')).key.nunique()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.