I have data.table with date field, created(user_accout) and user_id.
date | created | customer_id |
---|---|---|
2022-01-01 | 2021-05-07 | user1. |
2022-01-02 | 2022-01-02 | use2. |
2022-01-03 | 2021-02-02 | use3. |
2022-01-04 | 2022-01-05 | use4. |
2022-01-05 | 2022-01-05 | use5. |
2022-01-06 | 2022-01-08 | use6. |
I want to get a count (as new users based on the created field)of new users grouped by the date field
date | created | customer_id | new_users(based on date'colum1) |
---|---|---|---|
2022-01-01 | 2021-05-07 | user1. | 0 |
2022-01-02 | 2022-01-02 | use2. | 1 |
2022-01-03 | 2021-02-02 | use3. | 0 |
2022-01-04 | 2022-01-05 | use4. | 0 |
2022-01-05 | 2022-01-05 | use5. | 2 |
2022-01-06 | 2022-01-08 | use6. | 0 |
i tried using the groupby but i could not able to assign date == created to get count of the new users on particular date field.
First of all, I think it is better to split your data into two different tables. In the first table, you have only creation_date
s and customer_id
s. This is your actual input. It looks like this:
created_table = pd.DataFrame(
dict(
created=pd.Series(
[
"2022-01-07",
"2022-01-02",
"2022-01-05",
"2022-01-02",
"2022-01-05",
"2022-01-05",
],
dtype='datetime64[ns]'
),
customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
)
)
created customer_id
0 2022-01-07 user1
1 2022-01-02 user2
2 2022-01-05 user5
3 2022-01-02 user4
4 2022-01-05 user5
5 2022-01-05 user6
I changed it a little bit to make it more illustrative.
Now as far as I understood, you want to count how many unique customer_id
s exist for each date. This can be done with groupby
and nunique
.
customers_created = created_table.groupby('created')['customer_id'].nunique()
created customer_id
2022-01-02 2
2022-01-05 2
2022-01-07 1
Now you probably want to join this results with a series of consecutive dates. First, let's create an index with such dates:
dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")
Now let's reindex our series customers_created
with this new index:
(
customers_created.reindex(dates, fill_value=0)
.to_frame()
.reset_index()
.rename(columns={"customer_id": "new_users"})
)
date new_users
0 2022-01-01 0
1 2022-01-02 2
2 2022-01-03 0
3 2022-01-04 0
4 2022-01-05 2
5 2022-01-06 0
6 2022-01-07 1
7 2022-01-08 0
8 2022-01-09 0
9 2022-01-10 0
Depending on what you need, series or dataframe, you may drop the last part (ie .to_frame()
, etc).
This is probably what you were looking for. In your question, this table was merged with the original table, but I don't think it makes much sense, as there are no relation between rows of the initial table and the new table.
It is also possible that all values in customer_id
are guaranteed to be unique. Then you can replace created_table.groupby('created')['customer_id'].nunique()
with created_table['created'].value_counts()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.