简体   繁体   English

返回迭代器vs返回Python中的整个列表?

[英]Return iterator vs Return whole list in Python?

I tested some code for knowing which one is effective, returning iterator and returning whole list. 我测试了一些代码以了解哪个有效,然后返回迭代器并返回整个列表。

The program is about reading all lines of .txt file (really big size) and create word counting dictionary (Python3.4). 该程序是关于读取.txt文件的所有行(确实很大)并创建单词计数字典(Python3.4)。

1.Returning iterator 1,返回迭代器

from collections import defaultdict
import time

def create_word_cnt_dict(line_iter):
    doc_vector = defaultdict(int)
    for line in line_iter:
        for word in line.split():
            doc_vector[word] += 1
    return dict(doc_vector)

def read_doc(doc_file):
    with open(doc_file) as f :
        while True:
            line = f.readline()
            if not line:
                break
            yield line

t0 = time.time()
line_iter = read_doc("./doc1.txt")
doc_vector = create_word_cnt_dict(line_iter)
t1 = time.time()
print(t1-t0)

It takes, 3.765739917755127 需要3.765739917755127

2.Returning whole list 2.返回整个列表

from collections import defaultdict
import time

def create_word_cnt_dict(line_list):
    doc_vector = defaultdict(int)
    for line in line_list:
        for word in line.split():
            doc_vector[word] += 1
    return dict(doc_vector)

def read_doc1(doc_file):
    with open(doc_file) as f :
        lines = f.readlines()
        return lines

t0 = time.time()
lines = read_doc1("./doc1.txt")
doc_vector = create_word_cnt_dict(lines)
t1 = time.time()
print(t1-t0)

It takes, 3.6890149116516113 需要3.6890149116516113

As you can see, returning whole list is much faster. 如您所见,返回整个列表要快得多。

But in respect of memory usage, Returning iterator is much more effective than returing whole list. 但是就内存使用而言,返回迭代器比重新创建整个列表更有效。

In book Effective Python , it recommends returning iterator for efficient memory usage. Effective Python一书中,建议返回迭代器以提高内存使用效率。 But I think that time complexity is more important than space complexity these days because todays computer has enough memory. 但是我认为,如今的时间复杂度比空间复杂度更为重要,因为当今的计算机具有足够的内存。

Please, give me some advices. 请给我一些建议。

In this case, I think your interpretation of "much faster" is different than mine. 在这种情况下,我认为您对“快得多”的解释与我的不同。 . . The timing differences are on the order of a few percent which isn't very much (likely not user noticeable unless your program runs for hours and then the difference is insignificant.) 时间上的差异约为百分之几,并不是很大(除非您的程序运行数小时 ,然后差异不明显,否则用户不会注意到)。

Couple that with the fact that iterators give you more flexibility. 再加上迭代器为您提供更大的灵活性这一事实。 What if you want to stop reading lines when you process a certain one? 如果要在处理某一行时停止阅读行怎么办? In that case, the iterator could be factors of 2 or more faster because you've gained the ability to "short circuit". 在这种情况下,迭代器的速度可能会提高2倍或更多,因为您已经具有“短路”的能力。

For the short circuiting reason and memory, I'd prefer the generator function here. 出于短路原因和内存的原因,在这里我更喜欢生成器功能。

Also note that your timings might be biased by the fact that you're reading a file. 另请注意,您正在阅读文件可能会影响您的时间安排。 readlines is probably going to be more efficient because python can read the file in even larger chunks than it normally would which means fewer calls to the OS. readlines可能会更有效率,因为python可以比通常更大的块读取文件,这意味着对OS的调用更少。 Many other applications won't have this sublety... 许多其他应用程序将没有此子任务...

Depends. 要看。

If we are talking about a relatively small amount of data then time-complexity won't differ either. 如果我们谈论的是相对少量的数据,那么时间复杂度也不会不同。

Think about huge amount of data and I am not talking about Gbs or TBs , much larger data set that huge companies like Google and Facebook need to handle every day, do you think that space complexity doesn't count as time complexity does? 想想海量数据,而我不是在谈论GbsTBs ,Google和Facebook这样的大公司每天都需要处理的更大数据集,您是否认为space complexity不算time complexity呢?

Space we are not talking about storage memory obviously but for RAM . 空间,我们显然不是在谈论存储内存,而是在谈论RAM

So your question is quite broad and it depends on the application, the amount of data that you are going to use and your requirements. 因此,您的问题非常广泛,并且取决于应用程序,将要使用的数据量和需求。 For relatively small dataset I don't think that time complexity will be a huge deal neither space complexity. 对于相对较小的数据集,无论空间复杂度,我都不认为时间复杂度会很大。

The performance difference is actually very slight. 性能差异实际上很小。

In light of that, a good programmer would choose the generator version because it is robust. 有鉴于此,优秀的程序员会选择生成器版本,因为它很健壮。

If you slurp the whole file you are setting a trap. 如果对整个文件进行处理,则将设置陷阱。 At some point in the future someone (maybe you) will try to pass in 1GB or 10GB, and they will get screwed over, and run around cursing "WHY??????" 在将来的某个时候,某人(也许您)会尝试通过1GB或10GB,他们会被搞砸,并四处骂骂“为什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM