简体   繁体   English

List 在 python 中是否表现不佳?

[英]is List badly performing in python?

I was trying to read data from some huge file and write them back, but I realised that the main cost came from assigning data to a list rather then reading or writing data from/to file....我试图从一些大文件中读取数据并将它们写回,但我意识到主要成本来自将数据分配给列表而不是从/向文件读取或写入数据......

    rows = [None] * 1446311
    begin = datetime.datetime.now()
    for i in range( 1446311 ):
       row = csvReader.next()
       rows[i] = row
    print datetime.datetime.now() - begin

above code is taking 18 sec but 5 sec if I comment out line 5 ( rows[i] = row ), I have build the list in advance (ie reserve the memory) but why it is still so slow?上面的代码需要 18 秒,但如果我注释掉第 5 行( rows[i] = row )则需要 5 秒,我已经提前构建了列表(即保留内存)但为什么它仍然这么慢? anything I could do the make it faster?我能做些什么让它更快? I tried row for row in csvReader but it performs worse...row for row in csvReader尝试,但效果更差......

regards, John问候,约翰

I get similar results, but not quite so dramatic as yours.我得到了类似的结果,但没有你的那么戏剧化。 (Note the use of the timeit module for timing code execution, and note that I've factored out the list creation since its common to both test cases.) (注意使用timeit模块来计时代码执行,并注意我已经排除了列表创建,因为它对两个测试用例都是通用的。)

import csv
from timeit import Timer

def write_csv(f, n):
    """Write n records to the file named f."""
    w = csv.writer(open(f, 'wb'))
    for i in xrange(n):
        w.writerow((i, "squared", "equals", i**2))

def test1(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        rows[i] = r

def test2(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        pass

def test(t): 
    return (Timer('test%d(rows, F, N)' % t,
                  'from __main__ import test%d, F, N; rows = [None] * N' % t)
            .timeit(number=1))

>>> N = 1446311
>>> F = "test.csv"
>>> write_csv(F, N)
>>> test(1)
2.2321770191192627
>>> test(2)
1.7048690319061279

Here's my guess as to what is going on.这是我对发生了什么的猜测。 In both tests, the CSV reader reads a record from the file and creates a data structure in memory representing that record.在这两个测试中,CSV 读取器从文件中读取记录,并在 memory 中创建代表该记录的数据结构。

In test2 , where the record is not stored, the data structure gets deleted more or less immediately (on the next iteration of the loop, the row variable is updated, so the reference count of the previous record is decremented, and so the memory is reclaimed).在未存储记录的test2中,数据结构或多或少立即被删除(在循环的下一次迭代中, row变量被更新,因此前一条记录的引用计数减少,因此 memory 是回收)。 This makes the memory used for the previous record available for reuse: this memory is already in the computer's virtual memory tables, and probably still in the cache, so it's (relatively) fast.这使得用于先前记录的 memory 可以重用:这个 memory 已经在计算机的虚拟 memory 表中(并且可能仍然在缓存中。

In test1 , where the record is stored, each record has to be allocated in a new region of memory, which has to be allocated by the operating system, and copied to the cache, so it's (relatively) slow.在存储记录的test1中,每个记录都必须分配在 memory 的新区域中,该区域必须由操作系统分配,并复制到缓存中,因此(相对)较慢。

So the time is not taken up by list assignment , but by memory allocation .所以时间不是由列表分配占用,而是由memory 分配占用


Here are another couple of tests that illustrate what's going on, without the complicating factor of the csv module.这里有另外几个测试说明了正在发生的事情,没有csv模块的复杂因素。 In test3 we create a new 100-element list for each row, and store it.test3中,我们为每一行创建一个新的 100 元素列表,并将其存储。 In test4 we create a new 100-element list for each row, but we don't store it, we throw it away so that the memory can be reused on the next time round the loop.test4中,我们为每一行创建一个新的 100 元素列表,但我们不存储它,我们将其丢弃,以便 memory 可以在下一次循环中重复使用。

def test3(rows, f, n):
    for i in xrange(n):
        rows[i] = [i] * 100

def test4(rows, f, n):
    for i in xrange(n):
        temp = [i] * 100
        rows[i] = None

>>> test(3)
9.2103338241577148
>>> test(4)
1.5666921138763428

So I think the lesson is that if you do not need to store all the rows in memory at the same time, don't do that, If you can, read them in one at a time, process them one at a time.所以我认为教训是,如果您不需要同时存储 memory 中的所有行,请不要这样做,如果可以,一次读取它们,一次处理它们。 and then forget about them so that Python can deallocate them.然后忘记它们,以便 Python 可以释放它们。

EDIT: this first part is not so valid (see comments below)编辑:这第一部分不是那么有效(见下面的评论)

Did you make a try like this:你有没有像这样尝试过:

rows = [None] * 1446311
for i in range( 1446311 ):
   rows[i] = csvReader.next()

Because from what I see in your code, you're copying the data twice: one from file to memory with row =... , and once from row to rows[i] .因为根据我在您的代码中看到的内容,您将数据复制了两次:一次从文件复制到 memory ,其中row =... ,一次从row复制到rows[i] As you have non-mutable things here (strings), we really are talking about copy of data, not about copy of references.由于这里有不可变的东西(字符串),我们实际上是在谈论数据的副本,而不是引用的副本。

Moreover, even if you created an empty list before, you put big piece of data inside;而且,即使你之前创建了一个空列表,你也会在里面放一大块数据; as you only put None in the beginning, no real memory space has been reserved.因为你只在开头放了None ,所以没有保留真正的 memory 空间。 So maybe you could as well write directly a very simple thing like this:所以也许你也可以直接写一个非常简单的东西:

rows = []
for i in range( 1446311 ):
   rows.append(csvReader.next())

or maybe even use the generator syntax directly!或者甚至可以直接使用生成器语法!

rows = list(csvReader)

EDIT After reading Gareth's answer, I did some time-testing on my proposals.编辑阅读加雷斯的回答后,我对我的建议进行了一些时间测试。 By the way, take care to put some protection when reading from an iterator, in order to stop nicely if the iterator is shorter than expected:顺便说一句,在从迭代器中读取数据时要注意提供一些保护,以便在迭代器比预期短时很好地停止:

>>> from timeit import Timer
>>> import csv
>>> # building some timing framework:
>>> def test(n):
    return min(Timer('test%d(F, N)' % t,
                  'from __main__ import test%d, F, N' % t)
            .repeat(repeat=10, number=1))

>>> F = r"some\big\csvfile.csv"
>>> N = 200000
>>> def test1(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):  # using iterator syntax
        if i > number_of_lines:  # and limiting the number of lines
            break
        row = c
        rows[i] = row
    return rows

>>> test(1)
0.31833305864660133

>>> def test2(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        row = c
    return rows

>>> test(2)
0.25134269758603978  # remember that only last line is stored!

>>> def test3(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        rows[i] = c
    return rows

>>> test(3)
0.30860502255637812

>>> def test4(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = []
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        rows.append(c)
    return rows

>>> test(4)
0.32001576256431008

>>> def test5(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = list(csvReader)  
    # problem: there's no way to limit the number of lines to parse!
    return rows

>>> test(5)
0.30347613834584308

What we can see, for a N greater than the number of lines in the document, no great difference in timing.我们可以看到,对于一个大于 N 的文档中的行数,在时间上并没有太大的区别。 test2 is, on my machine, unexpectingly only a little different. test2在我的机器上,出乎意料地只有一点点不同。 test5 is more elegant, but cannot limit the number of lines parsed, which can be annoying. test5更优雅,但不能限制解析的行数,这可能很烦人。

So, if you need all lines at once, my advice would be to go to the most elegant solution, even if a bit longer: test4 .因此,如果您一次需要所有行,我的建议是 go 最优雅的解决方案,即使更长一点: test4 But maybe, as Gareth's ask, you do not need everything at once, which is the best way to gain speed and memory.但也许,正如 Gareth 所问的那样,您并不需要一次性使用所有东西,这是获得速度和 memory 的最佳方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM