简体   繁体   English

如何使用 Pandas 从 df 创建等级

[英]How to create a rank from a df with Pandas

I have a table that is cronologically sorted, with an state and an amount fore each date.我有一个按时间顺序排序的表格,其中包含 state 和每个日期前的金额。 The table looks as follows:该表如下所示:

Date日期 State State Amount数量
01/01/2022 2022 年 1 月 1 日 1 1 1233.11 1233.11
02/01/2022 2022 年 2 月 1 日 1 1 16.11 16.11
03/01/2022 2022 年 3 月 1 日 2 2 144.58 144.58
04/01/2022 2022 年 4 月 1 日 1 1 298.22 298.22
05/01/2022 2022 年 5 月 1 日 2 2 152.34 152.34
06/01/2022 2022 年 6 月 1 日 2 2 552.01 552.01
07/01/2022 2022 年 7 月 1 日 3 3 897.25 897.25

To generate the dataset:要生成数据集:

pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})

I want to add a column called rank that is increased when the state changes.我想添加一个名为 rank 的列,当 state 更改时,该列会增加。 So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased.所以如果你有二十次 state 1,它只是排名 1。如果你有 state 2,当 state 1 再次出现时,排名增加。 That is, if for two days in a row State is 1, Rank is 1. Then, another state appears.也就是说,如果连续两天 State 为 1,则 Rank 为 1。然后,出现另一个 state。 When State 1 appears again, Rank would increment to 2.当 State 1 再次出现时,Rank 将增加到 2。

I want to add a column called "Rank" which has a value that increments itself if a given state appears again.我想添加一个名为“Rank”的列,如果给定的 state 再次出现,该列的值会自行增加。 It is like a counter amount of times that state appear consecutively. state 连续出现的次数就像一个计数器。 That it, if state.那它,如果 state。 An example would be as follows:一个例子如下:

Date日期 State State Amount数量 Rank
01/01/2022 2022 年 1 月 1 日 1 1 1233.11 1233.11 1 1
02/01/2022 2022 年 2 月 1 日 1 1 16.11 16.11 1 1
03/01/2022 2022 年 3 月 1 日 2 2 144.58 144.58 1 1
04/01/2022 2022 年 4 月 1 日 1 1 298.22 298.22 2 2
05/01/2022 2022 年 5 月 1 日 2 2 152.34 152.34 2 2
06/01/2022 2022 年 6 月 1 日 2 2 552.01 552.01 2 2
07/01/2022 2022 年 7 月 1 日 3 3 897.25 897.25 1 1

This could be also understanded as follows:这也可以理解为:

Date日期 State State Amount数量 Rank_State1 Rank_State1 Rank_State2 Rank_State2 Rank_State2 Rank_State2
01/01/2022 2022 年 1 月 1 日 1 1 1233.11 1233.11 1 1
02/01/2022 2022 年 2 月 1 日 1 1 16.11 16.11 1 1
03/01/2022 2022 年 3 月 1 日 2 2 144.58 144.58 1 1
04/01/2022 2022 年 4 月 1 日 1 1 298.22 298.22 2 2
05/01/2022 2022 年 5 月 1 日 2 2 152.34 152.34 2 2
06/01/2022 2022 年 6 月 1 日 2 2 552.01 552.01 2 2
07/01/2022 2022 年 7 月 1 日 3 3 897.25 897.25 1 1

Does anyone know how to build that Rank column starting from the previous table?有谁知道如何从上一个表开始构建该 Rank 列?

Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.您的问题属于 state 变化累积的一般类别,这表明使用累积和和布尔值的方法。

Here's one way you can do it - maybe not the most elegant, but I think it does what you need这是您可以做到的一种方法-也许不是最优雅的,但我认为它可以满足您的需要

import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})

someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()

def groupOccurrence(someRow):
    sa = someRow["StateAccumulator"]
    s = str(someRow["state"])
    stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
                    + int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
    return stateRank


someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)

If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:如果我理解正确,这就是您想要的结果 - “排名”旨在表示给定的一组连续状态出现的次数:

          date  state  amount StateAccumulator  Rank
0   01/08/2022      1     144                1     1
1   02/08/2022      1     142               11     1
2   03/08/2022      2     166              112     1
3   04/08/2022      2     144             1122     1
4   05/08/2022      3     142            11223     1
5   06/08/2022      1     166           112231     2
6   07/08/2022      1     144          1122311     2
7   08/08/2022      2     142         11223112     2
8   09/08/2022      2     166        112231122     2
9   10/08/2022      2     142       1122311222     2
10  11/08/2022      1     166      11223112221     3

Notes:笔记:

  • instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
  • you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state) you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
  • state change boolean is done like this: someDF["StateChange"] = someDF["state"].= someDF["state"].shift() state 更改 boolean 是这样完成的: someDF["StateChange"] = someDF["state"].= someDF["state"].shift()
  • so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.因此,对于给定行的给定 state,您将计算前几行中发生了多少 state 更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM