简体   繁体   English

优化 Python 代码以加快处理速度

[英]Optimize Python code for faster processing

I need some help/suggestions/guidance on how I can optimize my code.我需要一些关于如何优化代码的帮助/建议/指导。 The code works, but with huge data it has been running for almost a day.该代码有效,但由于数据量巨大,它已经运行了将近一天。 My data has ~ 2 million rows, with sample data ( few thousdand rows) it works.My sample data format is show below:我的数据有大约 200 万行,样本数据(几千行)可以正常工作。我的样本数据格式如下所示:

index   A    B
0   0.163   0.181
1   0.895   0.093
2   0.947   0.545
3   0.435   0.307
4   0.021   0.152
5   0.486   0.977
6   0.291   0.244
7   0.128   0.946
8   0.366   0.521
9   0.385   0.137
10  0.950   0.164
11  0.073   0.541
12  0.917   0.711
13  0.504   0.754
14  0.623   0.235
15  0.845   0.150
16  0.847   0.336
17  0.009   0.940
18  0.328   0.302

What I want to do : Given the above data set I want to bucket/bin each row into different buckets/bins based on values of A and B.Each index can only lie in one bin.我想要做什么:鉴于上述数据集,我想根据 A 和 B 的值将每一行分桶/分箱到不同的桶/箱中。每个索引只能位于一个箱中。 To do this I have discretized A and B from 0 to 1( step size of 0.1).为此,我将 A 和 B 从 0 离散化到 1(步长为 0.1)。 My bins for A look like this:我的 A 垃圾箱如下所示:

listA = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0] 

similar for B. B 类似。

listB = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

So total I have 10 * 10 = 100 bin So in total there are 100 bins, bin1 = (A,B) = (0,0), bin 2 = (0,0.1), bin 3 = (0,0.2)....bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1)..... bin(100) = (1,1) Then for each index, I am checking which bin each index lies in running a for loop shown below:所以我总共有 10 * 10 = 100 个 bin 所以总共有 100 个 bin1 = (A,B) = (0,0), bin 2 = (0,0.1), bin 3 = (0,0.2)。 ...bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1)..... bin(100) = (1,1) 那么对于每个索引,我正在检查每个索引位于哪个 bin 运行如下所示的 for 循环:

for index in df.index:
  sumlist = []
  for A in listA:
    for B in listB:
      filt_data = df[(df['A'] > A) & (df['A'] < A) & (df['B'] > B) & (df_input['B'] < B)]
      data_len = len(filt_data)
      sumlist = sumlist.append(data_len)
      df_sumlist = pd.DataFrame([sumlist])
   df_output = pd.concat([df_output , df_sumlist ] , axis = 0)

I tried using the pandas cut function for binning but it appears that it works for one column.我尝试使用 pandas cut function 进行分箱,但它似乎适用于一列。

Expected output预期 output

index   A         B    bin1   bin2 bin3 bin4 bin5 ...bin 23.. bin100    
    0   0.163   0.181   0      0     0   0    0           1     0
    1   0.895   0.093
    2   0.947   0.545
    3   0.435   0.307
    4   0.021   0.152
    5   0.486   0.977
    6   0.291   0.244
    7   0.128   0.946
    8   0.366   0.521
    9   0.385   0.137
    10  0.950   0.164
    11  0.073   0.541
    12  0.917   0.711
    13  0.504   0.754
    14  0.623   0.235
    15  0.845   0.150
    16  0.847   0.336
    17  0.009   0.940
    18  0.328   0.302

I do care about other bins even if they are zero, for eg: index 0 might lie in bin 23 so for index 0 I will have 1 in bin 23 and 0 in all other 99 bins.我确实关心其他箱,即使它们为零,例如:索引 0 可能位于箱 23 中,因此对于索引 0,我将在箱 23 中有 1,在所有其他 99 个箱中都有 0。 Similarly for index 1, it might lie in bin 91, so expected to have 1 in bin 91 and all bins 0 for index.类似地,对于索引 1,它可能位于 bin 91 中,因此预计 bin 91 中有 1,索引的所有 bin 都为 0。

Thanks for taking the time to read and help me with this, appreciate your help.感谢您花时间阅读并帮助我,感谢您的帮助。 Please let me know if I am missing anything or need to clarify things.如果我遗漏了什么或需要澄清一些事情,请告诉我。

You were on the right track!你在正确的轨道上! pd.cut is the way to go. pd.cut是通往 go 的方式。 I'm using the Series categories to create your final bins:我正在使用系列类别来创建您的最终垃圾箱:

import pandas as pd
import numpy as np

# Generate sample df
df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Create bins for each column
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, 11))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, 11))

# Create a combined bin using category codes for each binned column
df["combined_bin"] = df["bin_A"].cat.codes * 10 + df["bin_B"].cat.codes
df["combined_bin"] = pd.Categorical(df["combined_bin"], categories=range(100))

# Loop over categories to create new columns
for i in df["combined_bin"].cat.categories:
    df[f"bin_{i}"] = (df["combined_bin"] == i).astype(int)

You could probably use cut on each column and then combine the results to find the category of the row您可能可以在每一列上使用cut ,然后结合结果来查找该行的类别

acat = pd.cut(df['A'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
bcat = pd.cut(df['B'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
cat = 1 + bcat.cat.codes + acat.cat.codes * 10

With your sample data, I get有了你的样本数据,我得到

0     12
1     81
2     96
3     44
4      2
5     50
6     23
7     20
8     36
9     32
10    92
11     6
12    98
13    58
14    63
15    82
16    84
17    10
18    34
dtype: int8

get_dummies and reindex will give the wide columns get_dummiesreindex将给出宽列

w = pd.get_dummies(cat).reindex(columns=list(range(1,101))).fillna(0).astype('int8')

We only have to concat it to the original dataframe:我们只需要将它连接到原来的 dataframe:

pd.concat([df, w], axis=1)

to get as expected:达到预期:

        index      A      B  1  2  3  4  5  6  ...  92  93  94  95  96  97  98  99  100
0       0  0.163  0.181  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
1       1  0.895  0.093  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
2       2  0.947  0.545  0  0  0  0  0  0  ...   0   0   0   0   1   0   0   0    0
3       3  0.435  0.307  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
4       4  0.021  0.152  0  1  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
5       5  0.486  0.977  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
6       6  0.291  0.244  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
7       7  0.128  0.946  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
8       8  0.366  0.521  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
9       9  0.385  0.137  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
10     10  0.950  0.164  0  0  0  0  0  0  ...   1   0   0   0   0   0   0   0    0
11     11  0.073  0.541  0  0  0  0  0  1  ...   0   0   0   0   0   0   0   0    0
12     12  0.917  0.711  0  0  0  0  0  0  ...   0   0   0   0   0   0   1   0    0
13     13  0.504  0.754  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
14     14  0.623  0.235  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
15     15  0.845  0.150  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
16     16  0.847  0.336  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
17     17  0.009  0.940  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
18     18  0.328  0.302  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM