为什么6GB的csv文件无法以numpy的形式整体读取到内存（64GB）

Question

I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM 我有sze为6.8GB的csv文件，尽管我有64GB的RAM，但我无法将其读入numpy数组的内存中

CSV file has 10 milion of lines, each line has 131 records (mix of int and float) CSV文件有1000万行，每行有131条记录（int和float的混合）

I tried to read it to float numpy array 我试图读取它以浮动numpy数组

import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')

it failed due to memory. 由于内存而失败。

when I read just one line and get size 当我只读一行并得到大小时

data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes

I get 1048 bytes So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. 我得到1048个字节所以，我希望10.000.000 * 1048 = 10,48 GB应该可以毫无问题地存储在内存中。 Why it doesn't work? 为什么不起作用？

Finaly I tried to optimize array in memory by defining types 最后，我试图通过定义类型来优化内存中的数组

data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes

so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory. 所以我得到每行只有464B，所以这将是只有4.56 GB，但它仍然无法加载到内存中。

Do you have any idea? 你有什么主意吗？ I need to use this array in Keras. 我需要在Keras中使用此数组。 Thank you 谢谢

Answer 1

genfromtext is regular python code, that converts the data to a numpy array only as a final step. genfromtext是常规的python代码，仅在最后一步将数据转换为numpy数组。 During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. 在这最后一步，在RAM需要持有一个巨大的蟒蛇清单，以及所产生的numpy的阵列，都在同一时间。 Maybe you could try numpy.fromfile , or the Pandas csv reader. 也许您可以尝试使用numpy.fromfile或Pandas csv阅读器。 Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop. 由于您知道每列的数据类型和行数，因此您还可以自己预先分配一个numpy数组，并使用简单的for循环将其填充。

为什么6GB的csv文件无法以numpy的形式整体读取到内存（64GB）

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-01-31 21:06:08

为什么6GB的csv文件无法以numpy的形式整体读取到内存（64GB）

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-01-31 21:06:08

解决方案1
0 已采纳 2018-01-31 21:06:08