[英]Optimization for loop in a huge list of tuples
我有一个名为 permuted_trucks 的元组列表,大小为 38320568,列表中的每个元组都有 7 个值,我试图将所有元组的值的总和插入另一个列表。
在下面的代码中,cargo_list 是一个包含货物名称的列表(np 数组大小 7),distances_df 是 (44, 7) pd dataframe
Truck list 是一个大小为 49 的 np 数组,具有相同的元组值。元组表示所有 49 辆卡车的组合,它们选择了 7 种不同的产品
我正在运行这个循环:
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j]
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
问题是它非常慢..我正在寻找优化它的解决方案。
有人可以帮忙吗?
元组示例:
[('Hartford',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Macomb'),
('Home',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Robert'),
('Horse',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Viking')]
下面这一行代表dataframe distances_df中包含的距离之和
total_distance += distances_df.iloc[truck_index][j]
all_distances 的 output 将是一个大小为 38320568 的数组,总距离为
[ 34125, 21252, 13232, 512313, ..... 31231]
通过使用 numpy 阵列而不是数据帧,可以直接提高速度。 数据帧很慢。
编辑:添加代码以显示不同执行的差异。
导入所需模块
import time as time
import pandas as pd
import numpy as np
因为我们没有完整的信息。 这就是我用来显示差异的方法。
##All arrays
permuted_trucks = np.random.randint(7,size=(100000,7))
cargo_list = np.random.randint(7,size=(1,7))
truck_list = np.random.randint(7,size=(49,1))
##converting arrays to lists to show difference between list and arrays
permuted_trucks_list = permuted_trucks.tolist()
cargo_list_list = cargo_list.tolist()
truck_list_list = truck_list.tolist()
##array
distances_df_array= np.random.randint(7,size=(44,7))
##dataframe
distances_df = pd.DataFrame(distances_df_array)
现在让我们看看您的原始执行列表和数据框。
#Time taken for lists and dataframe
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j] #using dataframe
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and data frame : {}".format(end_time-start_time))
Output: time taken for list and data frame: 20.7517249584198
让我们看看当我们使用列表和数组(避免数据框)时会发生什么。
#Time taken for lists and array
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and array : {}".format(end_time-start_time))
Output: time taken for list and array: 3.075411319732666
您可以看到执行时间的显着改善
最后,还要检查所有 arrays。
#Time taken for numpy array without vectorization
start_time = time.time()
all_distances_array = np.zeros((100000,1))
all_distances_index_array = np.zeros((100000,1))
best_distance = 10
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index_array[i]
all_distances_array[i]
end_time = time.time()
print("time taken for numpy arrays : {}".format(end_time-start_time))
Output: time taken for numpy arrays: 1.1893165111541748
现在你看到了不同之处,数据帧有多慢。 如果您可以实现矢量化,Numpys 可以比这快得多。 但这只能通过原始数据进行检查。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.