簡體   English   中英

如何優化以下迭代數百萬行的 DataFrame 的算法?

[英]How to optimize following algorithm that iterates over a DataFrame of few million of rows?

我有以下算法迭代具有數百萬行的 DataFrame。 算法完成需要很長時間。 你有什么建議嗎?

def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
    df_averaged = df.copy()
    df[helper.modifiable_columns] = df[helper.modifiable_columns].astype(float)
    df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
    for i in range(0, df.shape[0]):
        neighbours = list(range(i-k if i-k >= 0 else 0, i+k if i+k <= df_averaged.shape[0] else df_averaged.shape[0]))
        neighbours.remove(i)
        selectedNeighbourIndex = choice(neighbours)
        factor = uniform(0,1)
        currentSampleValues = df[helper.modifiable_columns].iloc[i]
        neighbourSampleValues = df[helper.modifiable_columns].iloc[selectedNeighbourIndex]
        average = 0
        if not use_abs_value: average = factor*(currentSampleValues - neighbourSampleValues)
        else: average = factor*(abs(currentSampleValues - neighbourSampleValues)) 
        df_averaged.loc[i,helper.modifiable_columns] = currentSampleValues + average
    return df_averaged

您應該始終想要做的第一件事就是向量化循環。 下面是修改后的代碼,它避免使用 Python 循環,而是使用 NumPy 操作:

import pandas as pd
import numpy as np

def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
    df_averaged = df.copy()
    df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
    num_rows = df.shape[0]
    modifiable_columns = helper.modifiable_columns

    # create a matrix of the neighbour indices for each row
    neighbour_indices = np.empty((num_rows, k*2+1), dtype=int)
    neighbour_indices[:, k] = np.arange(num_rows)  # set the current row index as the middle value
    for i in range(k):
        # set the left neighbours
        neighbour_indices[i+1:, i] = neighbour_indices[i:-1, k] - 1
        # set the right neighbours
        neighbour_indices[:-i-1, k+i+1] = neighbour_indices[1:, k] + 1
    # set the values outside the range of the DataFrame to -1
    neighbour_indices[neighbour_indices < 0] = -1
    neighbour_indices[neighbour_indices >= num_rows] = -1

    # select the neighbour indices to use for each row
    selected_neighbour_indices = neighbour_indices[:, neighbour_indices[0] >= 0]

    # create a matrix of factors
    factors = np.random.uniform(size=(num_rows, selected_neighbour_indices.shape[1]))

    # select the neighbour values for each row
    neighbour_values = df[modifiable_columns].values[selected_neighbour_indices]

    # select the current values for each row
    current_values = df[modifiable_columns].values[:, np.newaxis]

    # calculate the average values
    if not use_abs_value:
        averages = factors * (current_values - neighbour_values)
    else:
        averages = factors * np.abs(current_values - neighbour_values)

    # update the values in the output DataFrame
    df_averaged[modifiable_columns] = current_values + averages

    return df_averaged

我認為這將比原始腳本快得多。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM