使用 python/pandas 为 A 列中的每个唯一记录获取 B 列中的唯一值

Question

I'm in search for a quick&productive workaround for the following task.我正在为以下任务寻找快速且高效的解决方法。

I need to create a separate column for each DeviceID .我需要为每个DeviceID创建一个单独的列。 The column must contain an array with unique SessionStartDate values for each DeviceID .该列必须包含一个数组，该数组具有每个DeviceID唯一SessionStartDate值。

For example:例如：

8846620190473426378 | 8846620190473426378 | [2018-08-01, 2018-08-02] [2018-08-01, 2018-08-02]
381156181455864495 | 381156181455864495 | [2018-08-01] [2018-08-01]

Though user 8846620190473426378 may have had 30 sessions on 2018-08-01, and 25 sessions on 2018-08-02, I'm only interested in unique dates when these sessions occurred.虽然用户8846620190473426378可能在 2018-08-01 有 30 个会话，在 2018-08-02 有 25 个会话，但我只对这些会话发生的唯一日期感兴趣。

Currently, I'm using this approach:目前，我正在使用这种方法：

df_main['active_days'] = [
sorted(
    list(
        set(
            sessions['SessionStartDate'].loc[sessions['DeviceID'] == x['DeviceID']]
            )
        )
    )  
for _, x in df_main.iterrows()
]

df_main here is another DataFrame, containing aggregated data grouped by DeviceID df_main这里是另一个 DataFrame，包含按 DeviceID 分组的聚合数据

The approach seems to be very ( Wall time: 1h 45min 58s ) slow, and I believe there's a better solution for the task.这种方法似乎非常慢（ Wall time: 1h 45min 58s ），我相信有更好的解决方案。

Thanks in advance!提前致谢！

Answer 1

I believe you need sort_values with SeriesGroupBy.unique :我相信你需要sort_values和SeriesGroupBy.unique ：

rng = pd.date_range('2017-04-03', periods=4)
sessions = pd.DataFrame({'SessionStartDate': rng, 'DeviceID':[1,2,1,2]})  
print (sessions)
  SessionStartDate  DeviceID
0       2017-04-03         1
1       2017-04-04         2
2       2017-04-05         1
3       2017-04-06         2

#if necessary convert datetimes to dates
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
               .groupby('DeviceID')['SessionStartDate']
               .unique())
print (out)
DeviceID
1    [2017-04-03, 2017-04-05]
2    [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object

Another solution is remove duplicates by drop_duplicates and groupby with converting to list s:另一种解决方案是通过drop_duplicates和groupby删除重复drop_duplicates并转换为list s：

sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
               .drop_duplicates(['DeviceID', 'SessionStartDate'])
               .groupby('DeviceID')['SessionStartDate']
               .apply(list))
print (out)
DeviceID
1    [2017-04-03, 2017-04-05]
2    [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object

使用 python/pandas 为 A 列中的每个唯一记录获取 B 列中的唯一值

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-12-06 10:01:03

使用 python/pandas 为 A 列中的每个唯一记录获取 B 列中的唯一值

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-12-06 10:01:03

解决方案1
1 已采纳 2018-12-06 10:01:03