简体   繁体   中英

Aggregate by counting the distinct value with datetime constraint

I have two python pandas dataframes, in simplified form they look like this:

DF1

+---------+---------+------+-------+
| Date_in | Date_out| Group| Item  |
+---------+---------+------+-------+
| 1991-08 | 2000-08 |   A  |   A1  |
| 1991-08 | 2021-02 |   A  |   A2  |
| 1997-02 | 2021-02 |   B  |   B1  |
| 1998-03 | 2001-03 |   C  |   C1  |
| 1999-02 | 2002-02 |   D  |   D1  |
| 2000-09 | 2021-02 |   D  |   D2  |
| 2000-03 | 2001-04 |   D  |   D3  |
| 2001-08 | 2021-02 |   D  |   D4  |
+---------+---------+------+-------+

DF2

+---------+---------+-------+
|  Date   |  Group  |  Item |
+---------+---------+-------+
| 2000-06 |    A    |   A1  |
| 2000-06 |    A    |   A1  |
| 2000-07 |    A    |   A1  |
| 2000-07 |    A    |   A1  |
| 2000-07 |    A    |   A2  |
| 2000-07 |    B    |   B1  |
| 2000-08 |    D    |   D3  |
| 2000-08 |    D    |   D4  |
| 2001-05 |    D    |   D1  |
| 2001-05 |    D    |   D2  |
| 2001-05 |    D    |   D3  |
| 2002-04 |    D    |   D2  |
| 2002-04 |    D    |   D2  |
+---------+---------+-------+
  1. I want merge DF2 by Date & Group and count how many distinct values of item in DF1, if the dates in the new merged DF lie between datetime constraint of DF1,

  2. And, How many distinct items exist based on datetime constraint in the new merged DF (I think it is solved by @Rick_M's answer)

Desired output

+---------+---------+------------------------+-----------------------+
|  Date   |  Group  |      Total_item_1      |       Total_item_2    |
+---------+---------+------------------------+-----------------------+
| 2000-06 |    A    |            2           |            1          |
| 2000-07 |    A    |            1           |            1          |
| 2000-07 |    B    |            1           |            1          |
| 2000-08 |    C    |            1           |            0          |
| 2000-08 |    D    |            3           |            2          |
| 2001-05 |    D    |            3           |            3          |
| 2002-04 |    D    |            2           |            1          |
+---------+---------+------------------------+-----------------------+

Appreciate any comments and feedback, hope I've served the idea more clearly

I'm still not quite sure I understand your question because I don't reproduce the same "desired output" (is there possibly an error above?), but even if not I'm hoping this will still be helpful to you.

Your data:

df1 = pd.DataFrame.from_records([('1991-08', '2000-08', 'A', 'A1'), ('1991-08', '2021-02', 'A', 'A2'),
 ('1997-02', '2021-02', 'B', 'B1'), ('1998-03', '2001-03', 'C', 'C1'),
 ('1999-02', '2002-02', 'D', 'D1'), ('2000-09', '2021-02', 'D', 'D2'),
 ('2000-03', '2001-04', 'D', 'D3'), ('2001-08', '2021-02', 'D', 'D4')], columns=['Date_in','Date_out','Group','Item'])

df2 = pd.DataFrame.from_records([('2000-06', 'A', 'A1'), ('2000-06', 'A', 'A1'),
                 ('2000-07', 'A', 'A1'), ('2000-07', 'A', 'A1'),
                 ('2000-07', 'A', 'A2'), ('2000-07', 'B', 'B1'),
                 ('2000-08', 'D', 'D3'), ('2000-08', 'D', 'D4'),
                 ('2001-05', 'D', 'D1'), ('2001-05', 'D', 'D2'),
                 ('2001-05', 'D', 'D3'), ('2002-04', 'D', 'D2'),
                 ('2002-04', 'D', 'D2')], columns=['Date','Group','Item'])

Changing fields to be datetime type:

df1['Date_in'] = pd.to_datetime(df1['Date_in'], format="%Y-%m")
df1['Date_out'] = pd.to_datetime(df1['Date_out'], format="%Y-%m")
df2['Date'] = pd.to_datetime(df2['Date'], format="%Y-%m")

We can drop duplicates from df2 immediately:

df2 = df2.drop_duplicates().copy()

...and then groupby Date and Group , to get what I think is your Total_item_2 column:

tmp1 = df2.groupby(['Date','Group']).nunique().rename(columns={'Item':'Total_item_2'}).reset_index()
print(tmp1)
        Date Group  Total_item_2
0 2000-06-01     A             1
1 2000-07-01     A             2
2 2000-07-01     B             1
3 2000-08-01     D             2
4 2001-05-01     D             3
5 2002-04-01     D             1

For the next part, I'll leave various intermediate steps so that you can inspect what is happening. You could combine some of these steps if you wanted to.

Merge df1 with this new result dataframe, and create a valid_date column that is True if the date satisfies the constraint:

tmp = pd.merge(df1, tmp1[['Date','Group']], on='Group', suffixes=['_1','_2'], how='left')
tmp['valid_date'] = (tmp['Date']>=tmp['Date_in']) & (tmp['Date']<=tmp['Date_out'])

Then only use the rows with a valid date, and do a similar groupby to what we did earlier:

tmp2 = tmp[tmp['valid_date']].groupby(['Date','Group'])['Item'].nunique().reset_index().rename(columns={'Item':'Total_item_1'})

print(tmp2)
        Date Group  Total_item_1
0 2000-06-01     A             2
1 2000-07-01     A             2
2 2000-07-01     B             1
3 2000-08-01     D             2
4 2001-05-01     D             2
5 2002-04-01     D             2

Finally, you can merge tmp1 and tmp2 together (and reorder the columns):

result = pd.merge(tmp1, tmp2, on=['Date', 'Group'])
result = result[['Date','Group','Total_item_1','Total_item_2']]

print(result)
        Date Group  Total_item_1  Total_item_2
0 2000-06-01     A             2             1
1 2000-07-01     A             2             2
2 2000-07-01     B             1             1
3 2000-08-01     D             2             2
4 2001-05-01     D             2             3
5 2002-04-01     D             2             1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM