[英]Python performance comparison: Convert an Iterator to List and adding elements by index (after preallocating) vs Append while iterating
Long story short. 长话短说。 What is better: 什么是更好的:
Given an iterator: eg. 给定一个迭代器: After Reading a CSV or getting query results from DB. 读取CSV或从数据库获取查询结果之后。 What would give better performance and why? 什么会带来更好的性能,为什么?
First Approach: Iterate using the iterator and append to required lists. 第一种方法:使用迭代器进行迭代并追加到所需列表。 Something like: 就像是:
element1_list=[]
element2_list=[]
for row in rows:
element1_list.append(row[element1_index])
element2_list.append(row[element2_index])
Second Approach: Convert the iterator to a list and access the elements after preallocation 第二种方法:将迭代器转换为列表,并在预分配后访问元素
row_list=list(rows)
length=len(row_list)
element1_list=[None]*length
element2_list=[None]*length
for i in range(0,length):
element1_list[i]=row_list[i][element1_index]
element2_list[i]=row_list[i][element2_index]
Preallocation has it's own benefits. 预分配有其自身的好处。 But conversion to a list, may itself be an iteration itself. 但是转换为列表本身可能只是迭代。 So what approach to choose and why? 那么选择什么方法,为什么呢? Would be interesting to know what happens under the hood? 知道引擎盖下会发生什么会很有趣吗?
EDIT : Again emphasizing, would like to know about the fundamental differences in these approaches. 编辑 :再次强调,想了解这些方法的根本区别。 NOT merely using timeit and doing empirical analysis, which i would like to do to back up theory and not the other way around 我不仅要使用时间来进行实证分析,我希望这样做是为了支持理论,而不是相反
Some of the performance criteria maybe: 一些性能标准可能是:
Since everyone recommends timeit , so here are the results and the accompanying code. 由于每个人都建议timeit ,因此这里是结果和随附的代码。
The code i used for testing is as follows: 我用于测试的代码如下:
import timeit
import matplotlib.pyplot as plt
import csv
index_element_1=0
index_element_2=2
def create_csv(num_records):
'''Creates a test CSV'''
a=open('test.csv','wb')
a.write("10,20,30,40,50\n"*num_records)
a.close()
def read_csv(filename):
'''Returns iterator'''
cr = csv.reader(open(filename,"rb"))
return cr
def convert_to_list_method():
global csv_iterator
csv_list=list(csv_iterator)
length_list=len(csv_list)
x=[None]*length_list
y=[None]*length_list
for i in range(0,length_list):
x[i]=csv_list[index_element_1]
y[i]=csv_list[index_element_2]
return [x,y]
def iterate_and_append_method():
global csv_iterator
x=[]
y=[]
for row in csv_iterator:
x.append(row[index_element_1])
y.append(row[index_element_2])
return [x,y]
CSV_SIZE=range(10000,1010000,10000)
time_convert_to_list=[0]*len(CSV_SIZE)
time_iterate=[0]*len(CSV_SIZE)
count=0
for csv_size in CSV_SIZE:
create_csv(csv_size)
global csv_iterator
csv_iterator=read_csv('test.csv')
time_convert_to_list[count]=timeit.timeit("convert_to_list_method()", setup="from __main__ import *",number=1)
csv_iterator=read_csv('test.csv')
time_iterate[count]=timeit.timeit("iterate_and_append_method()", setup="from __main__ import *",number=1)
count=count+1
plt.xlabel('CSV Size')
plt.ylabel('Time (s)')
plt.plot(CSV_SIZE,time_convert_to_list,label='Convert to List')
plt.plot(CSV_SIZE,time_iterate,label='Iterate')
plt.legend()
plt.show()
The results don't vary much. 结果差别不大。 I think all the comments were right. 我认为所有评论都是正确的。 It does not really make a lot of difference. 并没有太大的改变。
NB: I used only 1 iteration of each function in timeit since o/w the iterator would have to be recreated, since it got consumed by previous iteration! 注意:由于每个函数都必须重新创建迭代器,因此每次迭代仅使用1次迭代,因为迭代器已被先前的迭代消耗!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.