简体   繁体   中英

Why 6GB csv file is not possible to read whole to memory (64GB) in numpy

I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM

CSV file has 10 milion of lines, each line has 131 records (mix of int and float)

I tried to read it to float numpy array

import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')

it failed due to memory.

when I read just one line and get size

data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes

I get 1048 bytes So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. Why it doesn't work?

Finaly I tried to optimize array in memory by defining types

data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes

so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory.

Do you have any idea? I need to use this array in Keras. Thank you

genfromtext is regular python code, that converts the data to a numpy array only as a final step. During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. Maybe you could try numpy.fromfile , or the Pandas csv reader. Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM