简体   繁体   English

从长到宽将Pandas DataFrame重塑,同时添加许多列

[英]Reshaping pandas DataFrame from long to wide while adding many columns

I have a long DataFrame df in the following format: 我有以下格式的长DataFrame df

user_id day action1 action2 action3 action4 action5
      1   0       4       2       0       1       0
      1   1       4       2       0       1       0
      2   1       4       2       0       1       0

The values in the action columns represent the number of times the user took that action on that day. “操作”列中的值表示用户当天执行该操作的次数。 I would like to translate this into a wide DataFrame but be able to extend the time frame arbitrarily (say, to 365 days). 我想将其转换为一个宽泛的DataFrame但能够任意延长时间范围(例如,到365天)。

I can reshape to wide fairly easily with: 我可以很容易地将其重塑为宽幅:

df_indexed = df.set_index(['user_id', 'day'])
df_wide = df_indexed.unstack().fillna()

How would I go about adding the remaining 358 days filled with 0 for each of the five actions? 对于五个操作中的每个操作,我将如何添加剩余的358天(填充0)?

Here's something similar to what @ViktorKerkez suggested using pandas.merge 这类似于@ViktorKerkez使用pandas.merge建议的pandas.merge

In [83]: df
Out[83]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0

In [84]: days_joiner = DataFrame(dict(zip(['user_id', 'day'], zip(*list(itertools.product(df.user_id.unique(), range(365)))))))

In [85]: result = pd.merge(df, days_joiner, how='outer')

In [86]: result.head(10)
Out[86]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0
3        1    2      NaN      NaN      NaN      NaN      NaN
4        1    3      NaN      NaN      NaN      NaN      NaN
5        1    4      NaN      NaN      NaN      NaN      NaN
6        1    5      NaN      NaN      NaN      NaN      NaN
7        1    6      NaN      NaN      NaN      NaN      NaN
8        1    7      NaN      NaN      NaN      NaN      NaN
9        1    8      NaN      NaN      NaN      NaN      NaN

In [87]: result.fillna(0).head(10)
Out[87]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0
3        1    2        0        0        0        0        0
4        1    3        0        0        0        0        0
5        1    4        0        0        0        0        0
6        1    5        0        0        0        0        0
7        1    6        0        0        0        0        0
8        1    7        0        0        0        0        0
9        1    8        0        0        0        0        0

To be fair: here's a %timeit comparison of the two methods 公平地说:这是两种方法的%timeit比较

In [90]: timeit pd.merge(df, days_joiner, how='outer')
1000 loops, best of 3: 1.33 ms per loop

In [96]: timeit df_indexed.reindex(index, fill_value=0)
10000 loops, best of 3: 146 µs per loop

My answer is slower by about 9x! 我的答案慢了大约9倍!

You can use your MultiIndexed DataFrame, create a new index with itertools.product combining all the users from your DataFrame and all the days you want, and then just replace the index filling the missing values with 0. 您可以使用MultiIndexed DataFrame,并使用itertools.product创建一个新索引,将来自DataFrame的所有用户以及您希望使用的所有时间组合在一起,然后将填充缺失值的索引替换为0。

import itertools

users = df.user_id.unique()
df_indexed = df.set_index(['user_id', 'day'])
index = pd.MultiIndex.from_tuples(list(itertools.product(users, range(365))))
reindexed = df_indexed.reindex(index, fill_value=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM