groupby in pandas with where clause to get count

Question

I have data.table with date field, created(user_accout) and user_id.

date	created	customer_id
2022-01-01	2021-05-07	user1.
2022-01-02	2022-01-02	use2.
2022-01-03	2021-02-02	use3.
2022-01-04	2022-01-05	use4.
2022-01-05	2022-01-05	use5.
2022-01-06	2022-01-08	use6.

I want to get a count (as new users based on the created field)of new users grouped by the date field

date	created	customer_id	new_users(based on date'colum1)
2022-01-01	2021-05-07	user1.	0
2022-01-02	2022-01-02	use2.	1
2022-01-03	2021-02-02	use3.	0
2022-01-04	2022-01-05	use4.	0
2022-01-05	2022-01-05	use5.	2
2022-01-06	2022-01-08	use6.	0

i tried using the groupby but i could not able to assign date == created to get count of the new users on particular date field.

Answer 1

First of all, I think it is better to split your data into two different tables. In the first table, you have only creation_date s and customer_id s. This is your actual input. It looks like this:

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

I changed it a little bit to make it more illustrative.

Now as far as I understood, you want to count how many unique customer_id s exist for each date. This can be done with groupby and nunique .

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

Now you probably want to join this results with a series of consecutive dates. First, let's create an index with such dates:

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

Now let's reindex our series customers_created with this new index:

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0

Depending on what you need, series or dataframe, you may drop the last part (ie .to_frame() , etc).

This is probably what you were looking for. In your question, this table was merged with the original table, but I don't think it makes much sense, as there are no relation between rows of the initial table and the new table.

It is also possible that all values in customer_id are guaranteed to be unique. Then you can replace created_table.groupby('created')['customer_id'].nunique() with created_table['created'].value_counts() .

groupby in pandas with where clause to get count

Question

1 answers

solution1
0 2022-04-21 14:09:43

groupby in pandas with where clause to get count

Question

1 answers

solution1 0 2022-04-21 14:09:43

solution1
0 2022-04-21 14:09:43