[英]How to create a new pandas column with increasing sequence id, but retain same value within each group
I have a pandas dataframe that looks like the one below:我有一个 pandas dataframe,如下所示:
df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
})
hourOfDay category
0 5 pageA
1 5 pageB
2 8 pageA
3 8 pageB
4 13 pageA
5 13 pageB
Now, what I want is to create a new column with a monotonically increasing id.现在,我想要的是创建一个具有单调递增 id 的新列。 This id should be having same value within a group (hourOfDay).
此 ID 在组内应具有相同的值 (hourOfDay)。 I'm giving the example of the expected dataframe below.
我在下面给出了预期的 dataframe 的示例。
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
For now, we can assume that the category column can have only two values for simplicity, but it can be extended later.现在,为简单起见,我们可以假设类别列只能有两个值,但以后可以扩展它。 If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it.
如果我按 hourOfDay 分组,则该组中的每个单独的页面类别都应获得分配给它的相同值。 I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the
df.groupby("hourOfDay").cumcount()
method and then finally merge the two dataframe. But this approach seems way too convoluted.我可以通过从主要 dataframe(按类别过滤)中创建两个单独的 dataframe 来实现,对其进行排序并使用
df.groupby("hourOfDay").cumcount()
方法创建一个新列,然后最终合并两个 dataframe。但是这种方法似乎太复杂了。 So, I was wondering if there's a simpler way of achieving the same thing.所以,我想知道是否有更简单的方法来实现同样的事情。
Try:尝试:
>>> df['index'] = df['hourOfDay'].eq(df['hourOfDay'].shift(-1)).cumsum()
>>> df
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
>>>
Use eq
and shift
to determine whether the current value is the same as the previous value, then use cumsum
to cumulatively sum up the True
s and False
s.使用
eq
和shift
判断当前值是否与之前的值相同,然后使用cumsum
将True
和False
累加起来。
If need same index
per hourOfDay
use GroupBy.ngroup
:如果每个
hourOfDay
需要相同的index
,请使用GroupBy.ngroup
:
df['index'] = df.groupby('hourOfDay', sort=True).ngroup() + 1
df = df.sort_values('hourOfDay')
df['index'] = pd.factorize(df['hourOfDay'])[0] + 1
Use diff
and cumsum
:使用
diff
和cumsum
:
df['index'] = df['hourOfDay'].diff().ne(0).cumsum()
print(df)
# Output:
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.