如何從數千個 CSV 文件中創建比我的 RAM 大得多的 Numpy 數組？

Question

我有 1000 個 CSV 文件，我想要 append 並創建一個大的 numpy 數組。 問題是 numpy 陣列會比我的 RAM 大得多。 有沒有一種方法可以一次將一點寫入磁盤而無需將整個陣列都放在 RAM 中？

還有一種方法可以一次只從磁盤讀取陣列的特定部分嗎？

Answer 1

使用 numpy 和大型 arrays 時，有幾種方法，具體取決於您需要如何處理該數據。

最簡單的答案是使用更少的數據。 如果您的數據有很多重復元素，通常可以使用 scipy 中的稀疏數組，因為這兩個庫是高度集成的。

另一個答案（IMO：您問題的正確解決方案）是使用memory mapped array 。 這將使 numpy 僅在需要時將必要的部件加載到 ram，並將 rest 留在磁盤上。 包含數據的文件可以是使用任意數量的方法創建的簡單二進制文件，但可以處理此問題的內置 python 模塊是struct 。 添加更多數據就像在 append 模式下打開文件並寫入更多字節數據一樣簡單。 確保在將更多數據寫入文件時重新創建對 memory 映射數組的任何引用，以便信息是最新的。

最后是壓縮之類的東西。 Numpy 可以使用savez_compressed壓縮 arrays 然后可以使用numpy.load打開。 重要的是，壓縮的 numpy 文件不能進行內存映射，必須完全加載到 memory 中。 一次加載一列可能會使您低於閾值，但這同樣可以應用於其他方法以減少 memory 的使用。 Numpy 的內置壓縮技術只會節省磁盤空間而不是 memory。 可能存在執行某種流壓縮的其他庫，但這超出了我的答案的 scope。

這是一個將二進制數據放入文件然后將其作為內存映射數組打開的示例：

import numpy as np

#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
    #for 1024 "csv files"
    for _ in range(1024):
        csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
        f.write(csv_data.tobytes())

#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)

#read some data
print(np.mean(column_mmap[0:1024]))

#write some data
column_mmap[0:512] = .5

#deletion closes the memory-mapped file and flush changes to disk.
#  del isn't specifically needed as python will garbage collect objects no
#  longer accessable. If for example you intend to read the entire array,
#  you will need to periodically make sure the array gets deleted and re-created
#  or the entire thing will end up in memory again. This could be done with a
#  function that loads and operates on part of the array, then when the function
#  returns and the memory-mapped array local to the function goes out of scope,
#  it will be garbage collected. Calling such a function would not cause a
#  build-up of memory usage.
del column_mmap

#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
    #for 1024 "csv files"
    for _ in range(1024):
        csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
        f.write(csv_data.tobytes())

如何從數千個 CSV 文件中創建比我的 RAM 大得多的 Numpy 數組？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-08-12 18:16:11

如何從數千個 CSV 文件中創建比我的 RAM 大得多的 Numpy 數組？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-08-12 18:16:11

解決方案1
1 已采納 2020-08-12 18:16:11