简体   繁体   中英

Loading CSV file to NumPy memmap array uses too much memory

I am trying to load a 4.47GB CSV file to a memory-mapped NumPy array. On a GCP machine with 85GB of RAM, it takes approx. 500s to do so and results in a 1.03GB array.

The problem is that it consumes up to 26GB of RAM during the uploading-file-to-array process. Is there a way of modifying the following code so that it consumes less RAM (and, if possible, time) during the uploading process?

import tempfile, numpy as np

def create_memmap_ndarray_from_csv(csv_file): # load int8 csv file to int8 memory-mapped numpy array

    with open(csv_file, "r") as f:
        rows = len(f.readlines())
    with open(csv_file, "r") as f:
        cols = len(f.readline().split(','))

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray', suffix='.memmap')
    arr_int8_mm = np.memmap(memmap_file, dtype=np.int8, mode='w+', shape=(rows,cols))

    arr_int8_mm = np.loadtxt(csv_file, dtype=np.int8, delimiter=',')
    return arr_int8_mm

I have modified the code in light of the comments to the original question. The updated code uses less memory: 8GB instead of 26GB. The loadtext, readline, split approach reduces the use of memory even further, but is just too slow.

import tempfile, numpy as np, pandas as pd

def create_ndarray_from_csv(csv_file): # load csv file to int8 normal/memmap ndarray

    df_int8 = pd.read_csv(csv_file, dtype=np.int8, header=None)
    arr_int8 = df_int8.values
    del df_int8

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray-memmap', suffix='.npy')
    np.save(memmap_file.name, arr_int8)
    del arr_int8

    arr_mm_int8 = np.load(memmap_file.name, mmap_mode='r')
    return arr_mm_int8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM