如何使用递增的序列 ID 创建一个新的 pandas 列，但在每个组中保留相同的值

Question

I have a pandas dataframe that looks like the one below:我有一个 pandas dataframe，如下所示：

df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
                 'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
                })

    hourOfDay   category
0   5           pageA
1   5           pageB
2   8           pageA
3   8           pageB
4   13          pageA
5   13          pageB

Now, what I want is to create a new column with a monotonically increasing id.现在，我想要的是创建一个具有单调递增 id 的新列。 This id should be having same value within a group (hourOfDay).此 ID 在组内应具有相同的值 (hourOfDay)。 I'm giving the example of the expected dataframe below.我在下面给出了预期的 dataframe 的示例。

    hourOfDay   category    index
0           5   pageA       1
1           5   pageB       1
2           8   pageA       2
3           8   pageB       2
4          13   pageA       3
5          13   pageB       3

For now, we can assume that the category column can have only two values for simplicity, but it can be extended later.现在，为简单起见，我们可以假设类别列只能有两个值，但以后可以扩展它。 If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it.如果我按 hourOfDay 分组，则该组中的每个单独的页面类别都应获得分配给它的相同值。 I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the df.groupby("hourOfDay").cumcount() method and then finally merge the two dataframe. But this approach seems way too convoluted.我可以通过从主要 dataframe（按类别过滤）中创建两个单独的 dataframe 来实现，对其进行排序并使用df.groupby("hourOfDay").cumcount()方法创建一个新列，然后最终合并两个 dataframe。但是这种方法似乎太复杂了。 So, I was wondering if there's a simpler way of achieving the same thing.所以，我想知道是否有更简单的方法来实现同样的事情。

Answer 1

Try:尝试：

>>> df['index'] = df['hourOfDay'].eq(df['hourOfDay'].shift(-1)).cumsum()
>>> df
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3
>>>

Use eq and shift to determine whether the current value is the same as the previous value, then use cumsum to cumulatively sum up the True s and False s.使用eq和shift判断当前值是否与之前的值相同，然后使用cumsum将True和False累加起来。

Answer 2

If need same index per hourOfDay use GroupBy.ngroup :如果每个hourOfDay需要相同的index ，请使用GroupBy.ngroup ：

df['index'] = df.groupby('hourOfDay', sort=True).ngroup() + 1

Or factorize :或factorize ：

df = df.sort_values('hourOfDay')
df['index'] = pd.factorize(df['hourOfDay'])[0] + 1

Answer 3

Use diff and cumsum :使用diff和cumsum ：

df['index'] = df['hourOfDay'].diff().ne(0).cumsum()
print(df)

# Output:
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3

如何使用递增的序列 ID 创建一个新的 pandas 列，但在每个组中保留相同的值

问题描述

3 个解决方案

解决方案1
1 2021-10-11 07:42:24

解决方案2
1 已采纳 2021-10-11 07:46:10

解决方案3
1 2021-10-11 08:35:06

如何使用递增的序列 ID 创建一个新的 pandas 列，但在每个组中保留相同的值

问题描述

3 个解决方案

解决方案1 1 2021-10-11 07:42:24

解决方案2 1 已采纳 2021-10-11 07:46:10

解决方案3 1 2021-10-11 08:35:06

解决方案1
1 2021-10-11 07:42:24

解决方案2
1 已采纳 2021-10-11 07:46:10

解决方案3
1 2021-10-11 08:35:06