I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM
CSV file has 10 milion of lines, each line has 131 records (mix of int and float)
I tried to read it to float numpy array
import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')
it failed due to memory.
when I read just one line and get size
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes
I get 1048 bytes So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. Why it doesn't work?
Finaly I tried to optimize array in memory by defining types
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes
so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory.
Do you have any idea? I need to use this array in Keras. Thank you
genfromtext
is regular python code, that converts the data to a numpy array only as a final step. During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. Maybe you could try numpy.fromfile
, or the Pandas csv reader. Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.