[英]64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error
f = open("data.csv")
f.seek(0)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
The above is the code I am using to read a csv file. 上面是我用来读取csv文件的代码。 The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram.
csv文件只有大约800 MB,我使用的是64位系统,内存为8GB 。 The file contains 100 million lines.
该文件包含1亿行。 However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.
但是,更不用说读取整个文件了,即使读取前1000万行也给我一个'MemoryError:'<-这确实是整个错误消息。
Could someone tell me why please? 有人可以告诉我为什么吗? Also as a side question, could someone tell me how to read from, say the 20th million row please?
另外还有一个问题,有人可以告诉我如何朗读吗,比如说2000万行? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.
我知道我需要使用f.seek(某个数字),但是由于我的数据是一个csv文件,所以我不知道应该将哪个数字确切地放入f.seek()中,以便它从第20行开始精确读取。
Thank you very much. 非常感谢你。
could someone tell me how to read from, say the 20th million row please?
有人可以告诉我如何阅读,例如说第2千万行吗? I know I need to use f.seek(some number)
我知道我需要使用f.seek(一些数字)
No, you can't (and mustn't) use f.seek()
in this situation. 不,在这种情况下,您不能(也不能)使用
f.seek()
。 Rather, you must read each of the first 20 million rows somehow. 相反,您必须以某种方式读取前两千万行中的每一行。
The Python documentation has this recipie: Python文档具有以下配方:
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(islice(iterator, n, n), None)
Using that, you would start after 20,000,000 rows thusly: 使用它,您将因此开始经过20,000,000行:
#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
or perhaps this might go faster: 也许这可能会更快:
#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.