優化 Python 代碼以加快處理速度

Question

我需要一些關於如何優化代碼的幫助/建議/指導。 該代碼有效，但由於數據量巨大，它已經運行了將近一天。 我的數據有大約 200 萬行，樣本數據（幾千行）可以正常工作。我的樣本數據格式如下所示：

index   A    B
0   0.163   0.181
1   0.895   0.093
2   0.947   0.545
3   0.435   0.307
4   0.021   0.152
5   0.486   0.977
6   0.291   0.244
7   0.128   0.946
8   0.366   0.521
9   0.385   0.137
10  0.950   0.164
11  0.073   0.541
12  0.917   0.711
13  0.504   0.754
14  0.623   0.235
15  0.845   0.150
16  0.847   0.336
17  0.009   0.940
18  0.328   0.302

我想要做什么：鑒於上述數據集，我想根據 A 和 B 的值將每一行分桶/分箱到不同的桶/箱中。每個索引只能位於一個箱中。 為此，我將 A 和 B 從 0 離散化到 1（步長為 0.1）。 我的 A 垃圾箱如下所示：

listA = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

B 類似。

listB = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

所以我總共有 10 * 10 = 100 個 bin 所以總共有 100 個 bin1 = (A,B) = (0,0), bin 2 = (0,0.1), bin 3 = (0,0.2)。 ...bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1)..... bin(100) = (1,1) 那么對於每個索引，我正在檢查每個索引位於哪個 bin 運行如下所示的 for 循環：

for index in df.index:
  sumlist = []
  for A in listA:
    for B in listB:
      filt_data = df[(df['A'] > A) & (df['A'] < A) & (df['B'] > B) & (df_input['B'] < B)]
      data_len = len(filt_data)
      sumlist = sumlist.append(data_len)
      df_sumlist = pd.DataFrame([sumlist])
   df_output = pd.concat([df_output , df_sumlist ] , axis = 0)

我嘗試使用 pandas cut function 進行分箱，但它似乎適用於一列。

預期 output

index   A         B    bin1   bin2 bin3 bin4 bin5 ...bin 23.. bin100    
    0   0.163   0.181   0      0     0   0    0           1     0
    1   0.895   0.093
    2   0.947   0.545
    3   0.435   0.307
    4   0.021   0.152
    5   0.486   0.977
    6   0.291   0.244
    7   0.128   0.946
    8   0.366   0.521
    9   0.385   0.137
    10  0.950   0.164
    11  0.073   0.541
    12  0.917   0.711
    13  0.504   0.754
    14  0.623   0.235
    15  0.845   0.150
    16  0.847   0.336
    17  0.009   0.940
    18  0.328   0.302

我確實關心其他箱，即使它們為零，例如：索引 0 可能位於箱 23 中，因此對於索引 0，我將在箱 23 中有 1，在所有其他 99 個箱中都有 0。 類似地，對於索引 1，它可能位於 bin 91 中，因此預計 bin 91 中有 1，索引的所有 bin 都為 0。

感謝您花時間閱讀並幫助我，感謝您的幫助。 如果我遺漏了什么或需要澄清一些事情，請告訴我。

Answer 1

你在正確的軌道上！ pd.cut是通往 go 的方式。 我正在使用系列類別來創建您的最終垃圾箱：

import pandas as pd
import numpy as np

# Generate sample df
df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Create bins for each column
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, 11))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, 11))

# Create a combined bin using category codes for each binned column
df["combined_bin"] = df["bin_A"].cat.codes * 10 + df["bin_B"].cat.codes
df["combined_bin"] = pd.Categorical(df["combined_bin"], categories=range(100))

# Loop over categories to create new columns
for i in df["combined_bin"].cat.categories:
    df[f"bin_{i}"] = (df["combined_bin"] == i).astype(int)

Answer 2

您可能可以在每一列上使用cut ，然后結合結果來查找該行的類別

acat = pd.cut(df['A'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
bcat = pd.cut(df['B'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
cat = 1 + bcat.cat.codes + acat.cat.codes * 10

有了你的樣本數據，我得到

0     12
1     81
2     96
3     44
4      2
5     50
6     23
7     20
8     36
9     32
10    92
11     6
12    98
13    58
14    63
15    82
16    84
17    10
18    34
dtype: int8

get_dummies和reindex將給出寬列

w = pd.get_dummies(cat).reindex(columns=list(range(1,101))).fillna(0).astype('int8')

我們只需要將它連接到原來的 dataframe：

pd.concat([df, w], axis=1)

達到預期：

        index      A      B  1  2  3  4  5  6  ...  92  93  94  95  96  97  98  99  100
0       0  0.163  0.181  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
1       1  0.895  0.093  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
2       2  0.947  0.545  0  0  0  0  0  0  ...   0   0   0   0   1   0   0   0    0
3       3  0.435  0.307  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
4       4  0.021  0.152  0  1  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
5       5  0.486  0.977  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
6       6  0.291  0.244  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
7       7  0.128  0.946  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
8       8  0.366  0.521  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
9       9  0.385  0.137  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
10     10  0.950  0.164  0  0  0  0  0  0  ...   1   0   0   0   0   0   0   0    0
11     11  0.073  0.541  0  0  0  0  0  1  ...   0   0   0   0   0   0   0   0    0
12     12  0.917  0.711  0  0  0  0  0  0  ...   0   0   0   0   0   0   1   0    0
13     13  0.504  0.754  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
14     14  0.623  0.235  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
15     15  0.845  0.150  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
16     16  0.847  0.336  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
17     17  0.009  0.940  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
18     18  0.328  0.302  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0

優化 Python 代碼以加快處理速度

問題描述

2 個解決方案

解決方案1
0 2022-02-01 16:02:35

解決方案2
0 2022-02-01 16:24:28

優化 Python 代碼以加快處理速度

問題描述

2 個解決方案

解決方案1 0 2022-02-01 16:02:35

解決方案2 0 2022-02-01 16:24:28

解決方案1
0 2022-02-01 16:02:35

解決方案2
0 2022-02-01 16:24:28