熊猫：枚举索引中的重复项

Question

Let's say I have a list of events that happen on different keys. 假设我有一个在不同键上发生的事件列表。

data = [
    {"key": "A", "event": "created"},
    {"key": "A", "event": "updated"},
    {"key": "A", "event": "updated"},
    {"key": "A", "event": "updated"},
    {"key": "B", "event": "created"},
    {"key": "B", "event": "updated"},
    {"key": "B", "event": "updated"},
    {"key": "C", "event": "created"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. 我想首先在键上索引我的DataFrame，然后是枚举。 It looks like a simple unstack operation, but I'm unable to find how to do it properly. 它看起来像一个简单的unstack操作，但我无法找到如何正确地执行它。

The best I could do was 我能做的最好的是

df.set_index("key", append=True).swaplevel(0, 1)

          event
key            
A   0   created
    1   updated
    2   updated
    3   updated
B   4   created
    5   updated
    6   updated
C   7   created
    8   updated
    9   updated
    10  updated
    11  updated
    12  updated

but what I'm expecting is 但我期待的是

          event
key            
A   0   created
    1   updated
    2   updated
    3   updated
B   0   created
    1   updated
    2   updated
C   0   created
    1   updated
    2   updated
    3   updated
    4   updated
    5   updated

I also tried something like 我也尝试了类似的东西

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. 但订单未保留，因此我无法将结果应用为索引。 Besides, I feel it overkill for an operation that looks quite standard... 此外，我觉得看起来非常标准的操作有点过分了......

Any idea? 任何的想法？

Answer 1

`groupby` + `cumcount` `groupby` + `cumcount`

Here are a couple of ways: 以下是几种方法：

# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])\
       .rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())\
       .set_index(['key', 'count'])

print(df)

             event
key count         
A   0      created
    1      updated
    2      updated
    3      updated
B   0      created
    1      updated
    2      updated
C   0      created
    1      updated
    2      updated
    3      updated
    4      updated
    5      updated

Answer 2

You can do this in numpy like this: 您可以像这样在numpy中执行此操作：

# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

           event
key seq         
A   0    created
    1    updated
    2    updated
    3    updated
B   0    created
    1    updated
    2    updated
C   0    created
    1    updated
    2    updated
    3    updated
    4    updated
    5    updated

熊猫：枚举索引中的重复项

问题描述

2 个解决方案

解决方案1
6 已采纳 2018-11-15 22:07:12

`groupby` + `cumcount` `groupby` + `cumcount`

解决方案2
0 2018-11-15 22:20:33

熊猫：枚举索引中的重复项

问题描述

2 个解决方案

解决方案1 6 已采纳 2018-11-15 22:07:12

groupby + cumcount groupby + cumcount

解决方案2 0 2018-11-15 22:20:33

解决方案1
6 已采纳 2018-11-15 22:07:12

`groupby` + `cumcount` `groupby` + `cumcount`

解决方案2
0 2018-11-15 22:20:33