64位系统，8GB的RAM，超过800MB的CSV并使用python读取会导致内存错误

Question

f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

The above is the code I am using to read a csv file. 上面是我用来读取csv文件的代码。 The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. csv文件只有大约800 MB，我使用的是64位系统，内存为8GB 。 The file contains 100 million lines. 该文件包含1亿行。 However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message. 但是，更不用说读取整个文件了，即使读取前1000万行也给我一个'MemoryError：'<-这确实是整个错误消息。

Could someone tell me why please? 有人可以告诉我为什么吗？ Also as a side question, could someone tell me how to read from, say the 20th million row please? 另外还有一个问题，有人可以告诉我如何朗读吗，比如说2000万行？ I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row. 我知道我需要使用f.seek（某个数字），但是由于我的数据是一个csv文件，所以我不知道应该将哪个数字确切地放入f.seek（）中，以便它从第20行开始精确读取。

Thank you very much. 非常感谢你。

Answer 1

could someone tell me how to read from, say the 20th million row please? 有人可以告诉我如何阅读，例如说第2千万行吗？ I know I need to use f.seek(some number) 我知道我需要使用f.seek（一些数字）

No, you can't (and mustn't) use f.seek() in this situation. 不，在这种情况下，您不能（也不能）使用f.seek() 。 Rather, you must read each of the first 20 million rows somehow. 相反，您必须以某种方式读取前两千万行中的每一行。

The Python documentation has this recipie: Python文档具有以下配方：

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Using that, you would start after 20,000,000 rows thusly: 使用它，您将因此开始经过20,000,000行：

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

or perhaps this might go faster: 也许这可能会更快：

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

64位系统，8GB的RAM，超过800MB的CSV并使用python读取会导致内存错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-05-22 19:24:54

64位系统，8GB的RAM，超过800MB的CSV并使用python读取会导致内存错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-05-22 19:24:54

解决方案1
1 已采纳 2015-05-22 19:24:54