简体   繁体   English

基于 2 个日期列表分组的 Pandas Groupby 唯一计数

[英]Pandas Groupby nunique count based on grouping of 2 date lists

Similar to this question, but adding one more step: Rolling groupby nunique count based on start and end dates与此问题类似,但再增加一个步骤: 根据开始和结束日期滚动 groupby 唯一计数

I have a dataframe with a unique ID, a start date, end date, start year, and end year.我有一个带有唯一 ID、开始日期、结束日期、开始年份和结束年份的数据框。 Over the course of this time, the ID can start, stop, and be restarted.在这段时间内,ID 可以启动、停止和重新启动。

I would like to get a groupby nunique count of IDs over the course of all year.我想在全年中获得一个 groupby 唯一的 ID 计数。 Currently, I can count unique values for a start and end date of the ID, but how exactly do I incorporate including the years?目前,我可以计算 ID 的开始日期和结束日期的唯一值,但我究竟如何合并包括年份在内?

fun = pd.DataFrame({'ZIP_KEY': ['A', 'B', 'A'],
                   'start_month': [1, 2, 2],
                   'end_month': [4, 3, 7],
                   'start_year': [2016, 2016, 2017],
                   'end_year': [2016, 2017, 2018]})

fun["month_list"] = fun.apply(lambda x: list(range(x["start_month"], x["end_month"]+1)), axis=1)

fun["year_list"] = fun.apply(lambda x: list(range(x["start_year"], x["end_year"]+1)), axis=1)

fun = fun.explode("month_list")

fun = fun.explode("year_list")

fun.groupby(["year_list", "month_list"])["ZIP_KEY"].nunique()


year_list  month_list
2016       1             1
           2             2
           3             2
           4             1
2017       2             2
           3             2
           4             1
           5             1
           6             1
           7             1
2018       2             1
           3             1
           4             1
           5             1
           6             1
           7             1

If a Zip Key is multi year, my current method is not taking into account full year --> Starts Jan 2018, Ends Feb 2020, then we get [1,2] and [2018,2019,2020] , not the full years for 2018 and 2019. I should get counts [1,2,3,4,5,6,7,8,9,10,11,12] for [2018, 2019] , and [1,2] for 2020如果 Zip Key 是多年的,我目前的方法没有考虑全年 --> 从 2018 年 1 月开始,到 2020 年 2 月结束,然后我们得到[1,2][2018,2019,2020] ,而不是全年2018 年和 2019 年。我应该得到[1,2,3,4,5,6,7,8,9,10,11,12][2018, 2019][1,2]的 2020 年

Similar to my other answer, but this time we use pd.date_range with 'MS' frequency instead of range .与我的其他答案类似,但这次我们使用pd.date_range'MS'频率而不是range It's helpful to first create datetime columns that are the first of the month for the provided year-month combinations.首先为所提供的年月组合创建每月第一天的datetime列会很有帮助。

import pandas as pd

# Create start and end datetime column.
for per in ['start', 'end']:
    fun[per] = pd.to_datetime(fun[[f'{per}_year', f'{per}_month']]
                                  .rename(columns={f'{per}_year': 'year', f'{per}_month': 'month'})
                                  .assign(day=1))

df = pd.concat([pd.DataFrame({'date': pd.date_range(st, en, freq='MS'), 'key': k}) 
                for k, st, en in zip(fun['ZIP_KEY'], fun['start'], fun['end'])])

Now group for the output.现在分组输出。 If you want separate columns:如果你想要单独的列:

df.groupby([df.date.dt.year.rename('year'), df.date.dt.month.rename('month')]).key.nunique()

year  month
2016  1        1 # <━┓
      2        2 # <━╋━━┓ 
      3        2 #   A  ┃
      4        2 # <━┛  ┃
      5        1 #      ┃
      6        1 #      ┃
      7        1 #      ┃
      8        1 #      B
      9        1 #      ┃
      10       1 #      ┃
      11       1 #      ┃
      12       1 #      ┃
2017  1        1 #      ┃
      2        2 # <━━━━╋━┓    
      3        2 # <━━━━┛ ┃
      4        1 #        ┃
      5        1 #        ┃
      6        1 #        ┃
      7        1 #        ┃
      8        1 #        ┃
      9        1 #        ┃
      10       1 #        A
      11       1 #        ┃
      12       1 #        ┃
2018  1        1 #        ┃
      2        1 #        ┃
      3        1 #        ┃
      4        1 #        ┃
      5        1 #        ┃
      6        1 #        ┃
      7        1 # <━━━━━━┛

I sometimes prefer grouping by the period:我有时更喜欢按时期分组:

df.groupby(df.date.dt.to_period('M')).key.nunique()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM