简体   繁体   English

使用 pandas 中第一次出现的索引计算转换的最小值和最大值

[英]Calculate min and max value of a transition with index of first occurrence in pandas

I have a DataFrame:我有一个 DataFrame:

df = pd.DataFrame({'ID':['a','b','d','d','a','b','c','b','d','a','b','a'], 
                   'sec':[3,6,2,0,4,7,10,19,40,3,1,2]})
print(df)
   ID  sec
0   a    3
1   b    6
2   d    2
3   d    0
4   a    4
5   b    7
6   c   10
7   b   19
8   d   40
9   a    3
10  b    1
11  a    2

I want to calculate how many times a transition has occurred.我想计算转换发生了多少次。 Here in the ID column a->b is considered as a transition, similarly for b->d, d->d, d->a, b->c, c->b, b->a .这里在ID列中a->b被认为是一个转换,对于b->d, d->d, d->a, b->c, c->b, b->a I can do this using Counter like:我可以使用Counter来做到这一点,例如:

Counter(zip(df['ID'].to_list(),df['ID'].to_list()[1:]))
Counter({('a', 'b'): 3,
         ('b', 'd'): 2,
         ('d', 'd'): 1,
         ('d', 'a'): 2,
         ('b', 'c'): 1,
         ('c', 'b'): 1,
         ('b', 'a'): 1})

I also need to get min and max of the sec column of those transitions.我还需要获取这些转换的sec列的最小值和最大值。 Here for example a->b has occurred 3 times out of them min sec value is 1 and max sec value is 7 .例如,这里a->b已经发生了 3 次 min sec value is 1而 max sec value is 7 Also I want to get where this transition first occurred for a->b its 0. For the transition_index column I consider the first value of a transition, ie index of a and for calculating, min, max I take the second value of the transition, ie value at b .我还想知道这个转换首先发生在a->b它的 0 的位置。对于transition_index列,我考虑转换的第一个值,即a索引,为了计算 min,max,我取转换的第二个值,即b处的值。

Here is the final output I want to get:这是我想要得到的最终 output:

df = pd.DataFrame({'ID_1':['a','b','d','d','b','c','b'], 
                   'ID_2':['b','d','d','a','c','b','a'],
                   'sec_min':[1,2,0,3,10,19,2],
                   'sec_max':[7,40,0,4,10,19,2],
                   'transition_index':[0,1,2,3,5,6,10],
                   'count':[3,2,1,2,1,1,1]})
print(df)
  ID_1 ID_2  sec_min  sec_max  transition_index  count
0    a    b        1        7                 0      3
1    b    d        2       40                 1      2
2    d    d        0        0                 2      1
3    d    a        3        4                 3      2
4    b    c       10       10                 5      1
5    c    b       19       19                 6      1
6    b    a        2        2                10      1

How can I achieve this in Python?如何在 Python 中实现这一点?

Also I have a huge amount of data, so I'm looking for the fastest way possible.另外我有大量数据,所以我正在寻找最快的方法。

You have transitions of the form from -> to .你有from -> to的转换。 'transition_index' is based on the index of the "from" row, while the 'sec' aggregations are based on the value associated with the "to" row. 'transition_index'基于“from”行的索引,而'sec'聚合基于与“to”行关联的值。

We can shift the index and group on the ID and the shifted the ID, allowing us to use a single groupby with named aggregations to get the desired output.我们可以移动 ID 上的索引和组以及移动的 ID,允许我们使用具有命名聚合的单个 groupby 来获得所需的 output。


df = df.reset_index()
df['index'] = df['index'].shift().astype('Int64')

(df.groupby([df['ID'].shift(1).rename('ID_1'), df['ID'].rename('ID_2')], sort=False)
   .agg(sec_min=('sec', 'min'),
        sec_max=('sec', 'max'),
        transition_index=('index', 'first'),
        count=('sec', 'size'))
   .reset_index()
)

  ID_1 ID_2  sec_min  sec_max  transition_index  count
0    a    b        1        7                 0      3
1    b    d        2       40                 1      2
2    d    d        0        0                 2      1
3    d    a        3        4                 3      2
4    b    c       10       10                 5      1
5    c    b       19       19                 6      1
6    b    a        2        2                10      1

Start from adding columns with previous values of ID and sec :从添加具有先前IDsec值的列开始:

df['prevID']  = df.ID.shift(fill_value='')
df['prevSec'] = df.sec.shift(fill_value=0)

Then define the following function:然后定义如下function:

def find(df, IDfrom, IDto):
    rows = df.query('prevID == @IDfrom and ID == @IDto')
    tbl = rows.loc[:, ['prevSec', 'sec']].values
    n = rows.index.size
    return (n, tbl.min(), tbl.max()) if n > 0 else (n, 0, 0)

Now if you run this function eg to find transitions from a to b :现在,如果您运行此 function 以查找从ab的转换:

find(df, 'a', 'b')

you will get:你会得到:

(3, 1, 7)

Then call this function for all other from and to values.然后为所有其他fromto值调用此 function。

Note that this function returns proper result even if there is no transition between the given values.请注意,即使给定值之间没有转换,此 function 也会返回正确的结果。 Of course, you may choose other "surrogate" values for min and max if no transition has been found.当然,如果没有找到转换,您可以为minmax选择其他“代理”值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM