[英]Get unique values in column B for each unique record in column A using python/pandas
I'm in search for a quick&productive workaround for the following task.我正在为以下任务寻找快速且高效的解决方法。
I need to create a separate column for each DeviceID
.我需要为每个
DeviceID
创建一个单独的列。 The column must contain an array with unique SessionStartDate
values for each DeviceID
.该列必须包含一个数组,该数组具有每个
DeviceID
唯一SessionStartDate
值。
For example:例如:
Though user 8846620190473426378
may have had 30 sessions on 2018-08-01, and 25 sessions on 2018-08-02, I'm only interested in unique dates when these sessions occurred.虽然用户
8846620190473426378
可能在 2018-08-01 有 30 个会话,在 2018-08-02 有 25 个会话,但我只对这些会话发生的唯一日期感兴趣。
Currently, I'm using this approach:目前,我正在使用这种方法:
df_main['active_days'] = [
sorted(
list(
set(
sessions['SessionStartDate'].loc[sessions['DeviceID'] == x['DeviceID']]
)
)
)
for _, x in df_main.iterrows()
]
df_main
here is another DataFrame, containing aggregated data grouped by DeviceID df_main
这里是另一个 DataFrame,包含按 DeviceID 分组的聚合数据
The approach seems to be very ( Wall time: 1h 45min 58s
) slow, and I believe there's a better solution for the task.这种方法似乎非常慢(
Wall time: 1h 45min 58s
),我相信有更好的解决方案。
Thanks in advance!提前致谢!
I believe you need sort_values
with SeriesGroupBy.unique
:我相信你需要
sort_values
和SeriesGroupBy.unique
:
rng = pd.date_range('2017-04-03', periods=4)
sessions = pd.DataFrame({'SessionStartDate': rng, 'DeviceID':[1,2,1,2]})
print (sessions)
SessionStartDate DeviceID
0 2017-04-03 1
1 2017-04-04 2
2 2017-04-05 1
3 2017-04-06 2
#if necessary convert datetimes to dates
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.groupby('DeviceID')['SessionStartDate']
.unique())
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object
Another solution is remove duplicates by drop_duplicates
and groupby
with converting to list
s:另一种解决方案是通过
drop_duplicates
和groupby
删除重复drop_duplicates
并转换为list
s:
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.drop_duplicates(['DeviceID', 'SessionStartDate'])
.groupby('DeviceID')['SessionStartDate']
.apply(list))
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.