使用Pandas将多个时间序列行合并为一行

Question

I am using a recurrent neural network to consume time-series events (click stream). 我使用循环神经网络来消耗时间序列事件（点击流）。 My data needs to be formatted such that a each row contains all the events for an id. 我的数据需要格式化，以便每行包含id的所有事件。 My data is one-hot encoded, and I have already grouped it by the id. 我的数据是单热编码的，我已经用id对它进行了分组。 Also I limit the total number of events per id (ex. 2), so final width will always be known (#one-hot cols x #events). 另外，我限制每个id的事件总数（例如2），因此最终宽度将始终是已知的（＃one-hot cols x #events）。 I need to maintain the order of the events, because they are ordered by time. 我需要保持事件的顺序，因为它们是按时间排序的。

Current data state: 当前数据状态：

     id   page.A   page.B   page.C      
0   001        0        1        0
1   001        1        0        0
2   002        0        0        1
3   002        1        0        0

Required data state: 所需数据状态：

     id   page.A1   page.B1   page.C1   page.A2   page.B2   page.C2      
0   001        0         1         0         1         0         0
1   002        0         0         1         1         0         1

This looks like a pivot problem to me, but my resulting dataframes are not in the format I need. 这看起来像是一个pivot问题，但我生成的数据帧不是我需要的格式。 Any suggestions on how I should approach this? 关于如何处理这个问题的任何建议？

Answer 1

The idea here is to reset_index within each group of 'id' to get a count which row of that particular 'id' we are at. 这里的想法是在每个'id'组中reset_index来计算我们所在的那个特定'id'哪一行。 Then follow that up with unstack and sort_index to get columns where they are supposed to be. 然后按照了unstack和sort_index得到他们应该是列。

Finally, flatten the multiindex. 最后，展平多索引。

df1 = df.set_index('id').groupby(level=0) \
    .apply(lambda df: df.reset_index(drop=True)) \
    .unstack().sort_index(axis=1, level=1)  # Thx @jezrael for sort reminder

df1.columns = ['{}{}'.format(x[0], int(x[1]) + 1) for x in df1.columns]

df1

Answer 2

You can first create new column with cumcount for new column name, then set_index and unstack . 你可以先创建一个新列cumcount新的列名，然后set_index和unstack 。 Then you need sort columns in level 1 by sort_index , remove MultiIndex from columns by list comprehension and last reset_index : 然后，您需要通过sort_index对级别1的列进行排序，通过list comprehension MultiIndex从列中删除MultiIndex并最后reset_index ：

df['g'] = (df.groupby('id').cumcount() + 1).astype(str)

df1 = df.set_index(['id','g']).unstack()
df1.sort_index(axis=1,level=1, inplace=True)
df1.columns = [''.join(col) for col in df1.columns]
df1.reset_index(inplace=True)
print (df1)
   id  page.A1  page.B1  page.C1  page.A2  page.B2  page.C2
0   1        0        1        0        1        0        0
1   2        0        0        1        1        0        0

使用Pandas将多个时间序列行合并为一行

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-09-19 18:45:25

解决方案2
3 2016-09-19 18:45:33

使用Pandas将多个时间序列行合并为一行

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-09-19 18:45:25

解决方案2 3 2016-09-19 18:45:33

解决方案1
5 已采纳 2016-09-19 18:45:25

解决方案2
3 2016-09-19 18:45:33