[英]Pandas Dataframe: to_dict() poor performance
I work with apis that return large pandas dataframes. 我使用返回大熊猫数据帧的API。 I'm not aware of a fast way to iterate through the dataframe directly so I cast to a dictionary with to_dict()
. 我不知道直接迭代数据to_dict()
的快速方法,因此我使用to_dict()
转换为字典。
After my data is in dictionary form, the performance is fine. 我的数据以字典形式显示后,性能很好。 However, the to_dict()
operation tends to be a performance bottleneck. 但是, to_dict()
操作往往会成为性能瓶颈。
I often group columns of the dataframe together to form multi-index and use the 'index' orientation for to_dict()
. 我经常将数据to_dict()
列组合在一起以形成多索引,并为to_dict()
使用“索引”方向。 Not sure if the large multi-index drives the poor performance. 不知道大型多索引是否会导致性能下降。
Is there a faster way to cast a pandas dataframe? 有没有更快的方法来投射熊猫数据框? Maybe there is a better way to iterate directly over the dataframe without any cast? 也许有更好的方法直接在数据帧上进行迭代而不进行任何强制转换? Not sure if there is a way I could apply vectorization. 不知道是否有一种方法可以应用矢量化。
Below I give sample code which mimics the issue with timings: 下面,我给出了示例代码,该代码模仿了计时问题:
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
Thank you! 谢谢!
The common guidance is don't iterate, use functions on all rows columns, or grouped rows/columns. 常见的指导原则是不要迭代,在所有行列或分组的行/列上使用函数。 Below, in the third code block shows how to iterate over the numpy array whhich is the .values
attribute. 下面的第三个代码块显示了如何遍历numpy数组,该数组是.values
属性。 The results are: 结果是:
Pivot Construction takes: 0.012315988540649414 数据透视构建需要:0.012315988540649414
Dataframe iteration takes: 0.32346272468566895 数据框迭代所需时间:0.32346272468566895
Iteration over values takes: 0.004369020462036133 值迭代需要:0.004369020462036133
Cast to dictionary takes: 0.023524761199951172 转换为字典需要:0.023524761199951172
Iteration over dictionary takes: 0.0010480880737304688 字典迭代需要:0.0010480880737304688
import pandas as pd
from io import StringIO
# Test data
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iterate over all values in pivot table
t0 = time.time()
v = df_pivot.values
for row in range(df_pivot.shape[0]):
for column in range(df_pivot.shape[1]):
test = v[row, column]
t1 = time.time()
print('Iteration over values takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.