简体   繁体   中英

groupby in pandas with where clause to get count

I have data.table with date field, created(user_accout) and user_id.

date created customer_id
2022-01-01 2021-05-07 user1.
2022-01-02 2022-01-02 use2.
2022-01-03 2021-02-02 use3.
2022-01-04 2022-01-05 use4.
2022-01-05 2022-01-05 use5.
2022-01-06 2022-01-08 use6.

I want to get a count (as new users based on the created field)of new users grouped by the date field

date created customer_id new_users(based on date'colum1)
2022-01-01 2021-05-07 user1. 0
2022-01-02 2022-01-02 use2. 1
2022-01-03 2021-02-02 use3. 0
2022-01-04 2022-01-05 use4. 0
2022-01-05 2022-01-05 use5. 2
2022-01-06 2022-01-08 use6. 0

i tried using the groupby but i could not able to assign date == created to get count of the new users on particular date field.

First of all, I think it is better to split your data into two different tables. In the first table, you have only creation_date s and customer_id s. This is your actual input. It looks like this:

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

I changed it a little bit to make it more illustrative.

Now as far as I understood, you want to count how many unique customer_id s exist for each date. This can be done with groupby and nunique .

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

Now you probably want to join this results with a series of consecutive dates. First, let's create an index with such dates:

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

Now let's reindex our series customers_created with this new index:

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0


Depending on what you need, series or dataframe, you may drop the last part (ie .to_frame() , etc).

This is probably what you were looking for. In your question, this table was merged with the original table, but I don't think it makes much sense, as there are no relation between rows of the initial table and the new table.

It is also possible that all values in customer_id are guaranteed to be unique. Then you can replace created_table.groupby('created')['customer_id'].nunique() with created_table['created'].value_counts() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM