繁体   English   中英

在熊猫数据帧中找到重复整数之间最大位移的有效方法

[英]efficient way to find the max displacement between a repeating integer in a pandas dataframe

我想以有效的方式找到相同整数的两次连续出现之间的最大差异。 我可以尝试一个循环,但我的数据集超过 100,000 行,这非常麻烦。 有没有人有什么建议?

data = np.random.randint(5,30,size=100000)
df = pd.DataFrame(data, columns=['random_numbers'])

示例:在我的示例中,连续出现的5之间的最大差异是29 - 5 = 24

df.loc[79:93].values
array([[ 5],
       [17],
       [ 7],
       [15],
       [25],
       [23],
       [24],
       [22],
       [21],
       [29],
       [25],
       [28],
       [13],
       [19],
       [ 5]])

你可以试试这个:

g = df['random_numbers'].eq(5).cumsum()
df.groupby(g).max() - 5

数据较小的输出:

data = np.random.randint(5,30,size=30)
# array([28, 19, 29, 22, 10, 18, 13, 14, 25, 24, 21, 24, 10, 20, 20,  5, 23,
#         8, 29, 22, 24, 24, 24, 19, 12,  5,  6, 14,  5, 15])

df = pd.DataFrame(data, columns=['rand_nums'])
g = df['rand_nums'].eq(5).cumsum()

# Look at both df and g
# print(pd.concat([df, g], axis=1) # just for explanation.

    rand_nums  rand_nums
0          28          0  ⟶ group 1 starts here
1          19          0
2          29          0
3          22          0
4          10          0
5          18          0
6          13          0
7          14          0  # we take max from here i.e. 29.
8          25          0
9          24          0
10         21          0
11         24          0
12         10          0
13         20          0
14         20          0 ⟶ group1 ends here
15          5          1 ⟶ group2 starts here
16         23          1
17          8          1
18         29          1
19         22          1
20         24          1 # take max from here i.e 29
21         24          1
22         24          1
23         19          1
24         12          1 ⟶ group2 ends here.
25          5          2 ⟶ grp 3 starts here.
26          6          2 # take max from here i.e. 14
27         14          2 ⟶ grp 3 ends here.
28          5          3 ⟶ grp4 starts here. # take max from here i.e. 15
29         15          3 ⟶ grp4 ends here.

这给了我们:

df.groupby(g).max() - 5

           rand_nums
rand_nums           
0                 24
1                 24
2                  9
3                 10
    df.loc[79:93].max() - df.loc[79:93].min()

编辑:

index_integer = df.index[df['random_numbers'] == 5] # change 5 for your 
max_disp = []
for i in index[:-1]:
    max_displ.append(df[index[i]:index[i+1].max() - df[index[i]:index[i+1].mmin())

使用理解列表:

index_integer = df.index[df['random_numbers'] == 5] # change 5 for your number
max_displ = [df[l[i]:l[i+1]].max() - df[l[i]:l[i+1]].min() for i in range(0,len(l[:-1]))]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM