加載 CSV 文件到 NumPy memmap 數組使用太多 memory

Question

我正在嘗試將 4.47GB CSV 文件加載到內存映射的 NumPy 數組中。 在具有 85GB RAM 的 GCP 機器上，大約需要 . 這樣做需要 500 秒，結果是 1.03GB 陣列。

問題是它在上傳文件到陣列的過程中消耗了多達 26GB 的 RAM。 有沒有辦法修改以下代碼，以便在上傳過程中消耗更少的 RAM（如果可能的話，時間）？

import tempfile, numpy as np

def create_memmap_ndarray_from_csv(csv_file): # load int8 csv file to int8 memory-mapped numpy array

    with open(csv_file, "r") as f:
        rows = len(f.readlines())
    with open(csv_file, "r") as f:
        cols = len(f.readline().split(','))

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray', suffix='.memmap')
    arr_int8_mm = np.memmap(memmap_file, dtype=np.int8, mode='w+', shape=(rows,cols))

    arr_int8_mm = np.loadtxt(csv_file, dtype=np.int8, delimiter=',')
    return arr_int8_mm

Answer 1

我已經根據對原始問題的評論修改了代碼。 更新后的代碼使用更少的 memory：8GB 而不是 26GB。 loadtext, readline, split方法進一步減少了 memory 的使用，但是太慢了。

import tempfile, numpy as np, pandas as pd

def create_ndarray_from_csv(csv_file): # load csv file to int8 normal/memmap ndarray

    df_int8 = pd.read_csv(csv_file, dtype=np.int8, header=None)
    arr_int8 = df_int8.values
    del df_int8

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray-memmap', suffix='.npy')
    np.save(memmap_file.name, arr_int8)
    del arr_int8

    arr_mm_int8 = np.load(memmap_file.name, mmap_mode='r')
    return arr_mm_int8

加載 CSV 文件到 NumPy memmap 數組使用太多 memory

問題描述

1 個解決方案

解決方案1
0 2019-10-01 17:34:14

加載 CSV 文件到 NumPy memmap 數組使用太多 memory

問題描述

1 個解決方案

解決方案1 0 2019-10-01 17:34:14

解決方案1
0 2019-10-01 17:34:14