在 MNIST 數據上運行 sklearn PCA 會導致 memory 分配錯誤

Question

我正在嘗試在 MNIST 數據上運行 PCA（只是弄亂它試圖學習一些 ML 的東西），但是得到一個 memory 分配錯誤，這對我的機器來說似乎太小了。 我嘗試了兩個略有不同的代碼，以下是從該網站復制的： https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60 ，（我設法在 Iris 數據集上運行 PCA 絕對沒問題） .

但是，當我 go 運行以下命令時：

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)


from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)


pca.fit(train_img)

我得到錯誤：

Traceback (most recent call last):
  File "C:\...\Python\pca_mnist_new.py", line 12, in <module>
    scaler.fit(train_img)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\preprocessing\_data.py", line 667, in fit
    return self.partial_fit(X, y)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\preprocessing\_data.py", line 762, in partial_fit
    _incremental_mean_and_var(X, self.mean_, self.var_,
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\extmath.py", line 765, in _incremental_mean_and_var
    new_sum = _safe_accumulator_op(np.nansum, X, axis=0)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\extmath.py", line 711, in _safe_accumulator_op
    result = op(x, *args, **kwargs)
  File "<__array_function__ internals>", line 5, in nansum
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\nanfunctions.py", line 649, in nansum
    a, mask = _replace_nan(a, 0)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\nanfunctions.py", line 109, in _replace_nan
    a = np.array(a, subok=True, copy=True)
MemoryError: Unable to allocate 359. MiB for an array with shape (60000, 784) and data type float64
[Finished in 29.868s]

（當我運行之前使用已加載數據制作的代碼時，我遇到了類似的錯誤，序言略有不同：

Traceback (most recent call last):
  File "C:\...\Python\pca_MNIST.py", line 36, in <module>
    pca.fit(x)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 351, in fit
    self._fit(X)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 423, in _fit
    return self._fit_full(X, n_components)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 454, in _fit_full
    U, S, V = linalg.svd(X, full_matrices=False)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\scipy\linalg\decomp_svd.py", line 128, in svd
    u, s, v, info = gesXd(a1, compute_uv=compute_uv, lwork=lwork,
MemoryError: Unable to allocate 359. MiB for an array with shape (60000, 784) and data type float64
[Finished in 2.792s]

但兩者在底部都有完全相同的錯誤。）

我在 Windows 10 上，在 Atom 中運行此代碼，但我從命令行運行此代碼時遇到相同的錯誤，其他所有內容均已關閉。 我有 16 GB 的 RAM。

我知道 MiB 是 Mebibyte，其中 359 個似乎對於 16GB 內存的分配錯誤來說太小了，但這就是我有限的專業知識和沮喪的谷歌搜索讓我不知所措的地方。

我從這里看到： https://stackoverflow.com/questions/44508254/increasing-memory-limit-in-python#:~:text=Python%20doesn't%20limit%20memory,what%20you're%20looking% 20為。 ，即 Python 只是盡可能多地分配 memory 直到沒有剩余。

PCA function 是否有可能正在使用所有這些 memory 而這個錯誤僅僅是因為陣列打破了駱駝的背部？ 我的直覺說不，但在這一點上我真的超出了我的深度。

有什么辦法可以讓這個工作，所以我可以 go 玩一些低維數據？ 還是我將不得不繞道而行並手動編寫一些東西來執行此操作？

Answer 1

您絕對應該嘗試的一個簡單解決方法是降低浮點精度。 float64似乎過分了，甚至神經網絡也不使用那種精度。

import numpy as np

train_img = train_img.astype(np.float32)  # or even np.float16

也試試這個test_img 。

在 MNIST 數據上運行 sklearn PCA 會導致 memory 分配錯誤

問題描述

1 個解決方案

解決方案1
0 2020-06-21 13:18:16

在 MNIST 數據上運行 sklearn PCA 會導致 memory 分配錯誤

問題描述

1 個解決方案

解決方案1 0 2020-06-21 13:18:16

解決方案1
0 2020-06-21 13:18:16