简体   繁体   English

为熊猫中的每一行分配组的更有效方法

[英]More efficient way to assign group for each row in pandas

I have a data frame with more than 1000 columns, and I have a predefined group list.我有一个包含 1000 多列的数据框,并且我有一个预定义的组列表。 I would like to compare each cell value with the each group boundary and create a new column to assign the group name.我想将每个单元格值与每个组边界进行比较,并创建一个新列来分配组名。 I have written for loops but it took more than 5 mins to process it.我已经写了for loops但处理它花了超过 5 分钟。 Is there any more efficient way to achieve this?有没有更有效的方法来实现这一目标? Thanks谢谢

Here my data frame这是我的数据框

Frequency
21.0
18.0    
16.0    
10.0
10.0    
9.0    
10.0    
10.0      
5.0       
8.0 

And my predefined group list还有我预定义的组列表

> groups    
[(3, 5), (6, 10), (11, 30)]

What I would like to get is我想得到的是

Frequency   Group
21.0        11-30
18.0        11-30
16.0        11-30
10.0        6-10
10.0        6-10
9.0         6-10
10.0        6-10
10.0        6-10
5.0         3-5
8.0         6-10

Here is my code这是我的代码

for i in range(0, len(fre_table["Frequency"])):
    for j in range(0, len(groups)):
        if fre_table["Frequency"][i] >= groups[j][0] and fre_table["Frequency"][i] <= groups[j][1]:
            break
    fre_table['Group'][i] = "{}-{}".format(groups[j][0], groups[j][1])

Establishing the efficiency of the solution suggested by @BallpointBen in the comment section在评论部分建立@BallpointBen 建议的解决方案的效率

Data:数据:

import numpy as np
import pandas as pd

fre_table = pd.DataFrame({'Index':[0,1,2,3,4,5,6,7,8,9],
             'Frequency':[21.0, 18.0, 16.0, 10.0, 10.0, 9.0, 10.0, 10.0, 5.0, 8.0]})
groups = [(3, 5), (6, 10), (11, 30)]

Time taken by the initial solution: 0.5420初始解所用时间: 0.5420

import timeit
start_time = timeit.default_timer()
fre_table['Group'] = 0
for i in range(0, len(fre_table["Frequency"])):
    for j in range(0, len(groups)):
        if fre_table["Frequency"][i] >= groups[j][0] and fre_table["Frequency"][i] <= groups[j][1]:
            break
    fre_table['Group'][i] = "{}-{}".format(groups[j][0], groups[j][1])
elapsed_time = timeit.default_timer() - start_time

Time take by the final solution: 0.0043s最终解决方案0.0043s0.0043s

import timeit
start_time = timeit.default_timer()
bins = pd.IntervalIndex.from_tuples(groups)
fre_table['Group'] = pd.cut(fre_table['Frequency'], bins)
elapsed_time = timeit.default_timer() - start_time

About a 100 times faster!大约快 100 倍!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM