简体   繁体   English

在数据框 pandas 中跨列移动数据的更快方法

[英]Faster way of shifting data across columns in dataframe pandas

I have a dataframe divided in variables (a,b) and on time values (1,5).我有一个数据框,分为变量(a,b)和时间值(1,5)。 The columns of the dataframe are a combination of the variables and time values ("a_1").数据框的列是变量和时间值(“a_1”)的组合。 However, I need to transform these time values that are absolute into relative values.但是,我需要将这些绝对时间值转换为相对值。 For that, I have another dataframe with reference indicator stating how many time values to move.为此,我有另一个带有参考指标的数据框,说明要移动多少时间值。

Therefore, I want to shift the positions of the values I have according to a reference indicator, which is represented by another dataframe, and that changes according to the index.因此,我想根据参考指标移动我拥有的值的位置,该参考指标由另一个数据框表示,并根据索引而变化。

EX: If for a specific index, the reference indicator is 3, I'd want the data in that index to move to the left until the position 3 goes to a_1 (so is moves 2 (3-1) places), such as: Original: EX:如果对于特定索引,参考指标为 3,我希望该索引中的数据向左移动,直到位置 3 移动到 a_1(因此移动 2 (3-1) 个位置),例如: 原来的:

        a_1       a_2       a_3      a_4       a_5
0  0.854592  0.677819  0.071725  0.29312  0.948375

Shifted:转移:

 a_1      a_2       a_3  a_4  a_5
0  0.071725  0.29312  0.948375  NaN  NaN

I have created the code below, which successfully achieves the desired outcome, however if takes a lot of time to compute (I'm testing actually with 100k index).我已经创建了下面的代码,它成功地达到了预期的结果,但是如果需要大量时间来计算(我实际上是用 100k 索引进行测试)。

I would appreciate any help in optimizing the code.我将不胜感激在优化代码方面的任何帮助。

Reproducible code:可重现的代码:

import numpy as np
import pandas as pd

# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(1000,10),index=range(0,1000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 6)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (1000,1)),index=range(0,1000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars_shifted = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence +1) - SLOW
    df_vars_shifted = (df_vars_shifted.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)

How about this?这个怎么样? I didn't run the timing tests because I ran out of time.我没有运行计时测试,因为我没时间了。 I put some print outs of the looped dataframes to show what is happening.我打印了一些循环数据帧以显示正在发生的事情。 I changed the moving indicator to 0 for not moving, so then periods= can be a 0 so it doesn't shift.我将移动指示器更改为 0 表示不移动,因此 period= 可以为 0,因此它不会移动。 The .replace could be dangerous depending on the data, so it is a little rough. .replace 可能很危险,具体取决于数据,因此有点粗糙。

    import numpy as np
    import pandas as pd

    # main data to be shifted
    df_a = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
    df_a.columns = [
        var_name +"_"+str(j) for var_name in ['a'] for j in range(1, 6)
    ]

    df_b = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
    df_b.columns = [
        var_name +"_"+str(j) for var_name in ['b'] for j in range(1, 6)
    ]

    # reference index to determine how many places to be shifted
    df_ref = pd.DataFrame(
        np.random.randint(0, 4, size=(1000,1)),
        index=range(0,1000),
        columns=['moving_indicator']
    )

    df_a = df_a.merge(df_ref, how='inner', left_index=True, right_index=True)
    df_grp = df_a.groupby('moving_indicator')
    new_df_a = pd.DataFrame([])
    for indicator, gdf in df_grp:
        indicator
        indicator = indicator * -1
        gdf.shift(periods=indicator, axis=1)
        gdf = gdf.shift(periods=indicator, axis=1)
        new_df_a = pd.concat([new_df_a, gdf])
    
    new_df_a = new_df_a.sort_index()
    new_df_a = (
        new_df_a.replace({3: np.nan, 2: np.nan, 1: np.nan})
        .drop('moving_indicator', axis=1)
    )

    df_b = df_b.merge(df_ref, how='inner', left_index=True, right_index=True)
    df_grp = df_b.groupby('moving_indicator')
    new_df_b = pd.DataFrame([])
    for indicator, gdf in df_grp:
        indicator
        indicator = indicator * -1
        gdf.shift(periods=indicator, axis=1)
        gdf = gdf.shift(periods=indicator, axis=1)
        new_df_b = pd.concat([new_df_b, gdf])
    
    new_df_b = new_df_b.sort_index()
    new_df_b = (
        new_df_b.replace({3: np.nan, 2: np.nan, 1: np.nan})
        .drop('moving_indicator', axis=1)
    )

    final_df = new_df_a.merge(
        new_df_b, how='inner', left_index=True, right_index=True
    )

Edit: Here are the timings.编辑:这里是时间。 Question version:问题版本:

>>> print(timeit.repeat(dummy, repeat=5, number=1))
[0.1520585000034771, 0.1450397999942652, 0.1416596999988542,
0.14743759999691974, 0.14560850000270875]

My version:我的版本:

>>> print(timeit.repeat(my_func, repeat=5, number=1))
[0.022981900001468603, 0.0159782000046107, 0.01633900000160793,
0.015842399996472523, 0.01663669999834383]

I tried different ways, and the best one was is to use list comprehension + shift+ dataframe.where():我尝试了不同的方法,最好的方法是使用列表理解 + shift + dataframe.where():

var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 11)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_shifted_variables = [df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
    df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
    list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)

Full code with all the different approaches:具有所有不同方法的完整代码:

import numpy as np
import pandas as pd
import swifter

import time

t1 = time.process_time()

# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 11)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence +1) - SLOW
    df_vars_shifted = (df_vars.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)


elapsed_time1 = time.process_time() - t1
print(elapsed_time1)




t2 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence +1) - SLOW
    df_vars_shifted = (df_vars.swifter.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_2 = pd.concat(list_vars_shifted, axis=1)


elapsed_time2 = time.process_time() - t2
print(elapsed_time2)



t3 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
    
    # Shift accoording to month indicator (hence +1) - SLOW
    df_vars_shifted = pd.DataFrame([df_vars.iloc[i].shift(-(int(df_vars.iloc[i,0]))+1) for i in range(len(df_vars))]).drop(columns=['moving_indicator'])
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_3 = pd.concat(list_vars_shifted, axis=1)


elapsed_time3 = time.process_time() - t3
print(elapsed_time3)



t4 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
    
    # Shift accoording to month indicator (hence +1) - SLOW
    df_vars_shifted = pd.DataFrame(row[1].shift(-(int(row[1]['moving_indicator']))+1) for row in df_vars.iterrows()).drop(columns=['moving_indicator'])
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_4 = pd.concat(list_vars_shifted, axis=1)


elapsed_time4 = time.process_time() - t4
print(elapsed_time4)


t5 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_test = []
    for indicator in np.unique(df_vars['moving_indicator']):
        df_test10 = df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all')
        list_test.append(df_test10)
        
        df_vars_shifted = pd.concat(list_test).sort_index().drop(columns=['moving_indicator'])
        list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_5 = pd.concat(list_vars_shifted, axis=1)

elapsed_time5 = time.process_time() - t5
print(elapsed_time5)



t6 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_shifted_variables = [df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
    df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
    list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)

elapsed_time6 = time.process_time() - t6
print(elapsed_time6)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM