简体   繁体   English

如何优化以下迭代数百万行的 DataFrame 的算法?

[英]How to optimize following algorithm that iterates over a DataFrame of few million of rows?

I have the following algorithm that iterates over a DataFrame with few millions of rows.我有以下算法迭代具有数百万行的 DataFrame。 It takes a lot of time for the algorithm to finish.算法完成需要很长时间。 Do you have any suggestions?你有什么建议吗?

def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
    df_averaged = df.copy()
    df[helper.modifiable_columns] = df[helper.modifiable_columns].astype(float)
    df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
    for i in range(0, df.shape[0]):
        neighbours = list(range(i-k if i-k >= 0 else 0, i+k if i+k <= df_averaged.shape[0] else df_averaged.shape[0]))
        neighbours.remove(i)
        selectedNeighbourIndex = choice(neighbours)
        factor = uniform(0,1)
        currentSampleValues = df[helper.modifiable_columns].iloc[i]
        neighbourSampleValues = df[helper.modifiable_columns].iloc[selectedNeighbourIndex]
        average = 0
        if not use_abs_value: average = factor*(currentSampleValues - neighbourSampleValues)
        else: average = factor*(abs(currentSampleValues - neighbourSampleValues)) 
        df_averaged.loc[i,helper.modifiable_columns] = currentSampleValues + average
    return df_averaged

The first thing you should always want is to vectorize loops.您应该始终想要做的第一件事就是向量化循环。 Here is the modified code that avoids using Python loops and uses NumPy operations instead:下面是修改后的代码,它避免使用 Python 循环,而是使用 NumPy 操作:

import pandas as pd
import numpy as np

def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
    df_averaged = df.copy()
    df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
    num_rows = df.shape[0]
    modifiable_columns = helper.modifiable_columns

    # create a matrix of the neighbour indices for each row
    neighbour_indices = np.empty((num_rows, k*2+1), dtype=int)
    neighbour_indices[:, k] = np.arange(num_rows)  # set the current row index as the middle value
    for i in range(k):
        # set the left neighbours
        neighbour_indices[i+1:, i] = neighbour_indices[i:-1, k] - 1
        # set the right neighbours
        neighbour_indices[:-i-1, k+i+1] = neighbour_indices[1:, k] + 1
    # set the values outside the range of the DataFrame to -1
    neighbour_indices[neighbour_indices < 0] = -1
    neighbour_indices[neighbour_indices >= num_rows] = -1

    # select the neighbour indices to use for each row
    selected_neighbour_indices = neighbour_indices[:, neighbour_indices[0] >= 0]

    # create a matrix of factors
    factors = np.random.uniform(size=(num_rows, selected_neighbour_indices.shape[1]))

    # select the neighbour values for each row
    neighbour_values = df[modifiable_columns].values[selected_neighbour_indices]

    # select the current values for each row
    current_values = df[modifiable_columns].values[:, np.newaxis]

    # calculate the average values
    if not use_abs_value:
        averages = factors * (current_values - neighbour_values)
    else:
        averages = factors * np.abs(current_values - neighbour_values)

    # update the values in the output DataFrame
    df_averaged[modifiable_columns] = current_values + averages

    return df_averaged

I think this will be much faster than the original script.我认为这将比原始脚本快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何优化在Python中的大数据框架上迭代的代码 - How to optimize code that iterates on a big dataframe in Python 有没有办法对当前迭代 Pandas dataframe 中的行的代码进行矢量化? - Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe? 优化迭代数据帧的复杂循环的最佳方法 - Best way to optimize a complex loop that iterates a dataframe 在 pandas dataframe 中迭代超过 7000 万行的最快方法 - Fastest way to iterate over 70 million rows in pandas dataframe 如何将数据帧的每一行与以下两行进行比较,并基于这三行和一种算法来修改当前行? (熊猫) - How to compare each row of a dataframe to the following 2 rows, and modify the current row based on these 3 rows and an algorithm? (Pandas) Python Dataframe 从几百万行的大日期时间索引中提取唯一日期列表 - Python Dataframe extract list of unique dates from a big datetimeindex of few million rows 如何在python中优化以下算法的内存和时间使用 - How to optimize the memory and time usage of the following algorithm in python 如何在DataFrame Python中删除之后的行 - How to delete rows following after in dataframe python 用英语解释这段代码如何迭代一个数字 - Explain in English how this code iterates over a number 如何重置迭代集合的循环? - How to reset a loop that iterates over a set?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM