拟合 scikit-learn 模型时出现 Python MemoryError

Question

对于一个研究项目，我正在使用各种机器学习算法分析相关性。 因此，我运行以下代码（为演示而简化）：

# Make a custom scorer for pearson's r (from scipy)    
scorer = lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]

# Create a progress bar
progress_bar = tqdm(14400)

# Initialize a dataframe to store scores
df = pd.DataFrame(columns=["data", "pipeline", "r"])

# Loop over datasets
for data in datasets: #288 datasets
    X_train = data.X_train
    X_test = data.X_test
    y_train = data.y_train
    y_test = data.y_test
    
    # Loop over pipelines
    for pipeline in pipelines: #50 pipelines
        scores = cross_val_score(pipeline, X_train, y_train, cv=int(len(X_train)/3), scoring=scorer)
        r = scores.mean()
        # Create a new row to save data
        df.loc[(df.last_valid_index() or 0) + 1] = {"data": data.name, "pipeline": pipeline, "r": r}
        progress_bar.update(1)

progress_bar.close()

X_train 是一个形状为 (20, 34) 的 pandas 数据框

X_test 是一个形状为 (9, 34) 的 pandas 数据框

y_train 是长度为 20 的熊猫系列

y_test 是一个长度为 9 的 pandas 系列

管道的一个例子是：

Pipeline(steps=[('scaler', StandardScaler()),
                ('poly', PolynomialFeatures(degree=9)),
                ('regressor', LinearRegression())])

但是，经过大约 8700 次迭代（总共），我得到以下 MemoryError：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-9ff48105b8ff> in <module>
     40                 y = targets[label]
     41                 #Finally, we can test the correlation
---> 42                 scores = cross_val_score(regressor, X_train, y.loc[train_indices], cv=int(len(X_train)/3), scoring=lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]) #Three samples per test set, as that seems like the logical minimum for Pearson
     43                 r = scores.mean()
     44 #                     print(f"{regressor} was able to predict {label} based on the {band} band of the {network} network with a Pearson's r of {r} of the data that could be explained.\n")

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    513     scorer = check_scoring(estimator, scoring=scoring)
    514 
--> 515     cv_results = cross_validate(
    516         estimator=estimator,
    517         X=X,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    283     )
    284 
--> 285     _warn_or_raise_about_fit_failures(results, error_score)
    286 
    287     # For callabe scoring, the return type is only know after calling. If the

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
    365                 f"Below are more details about the failures:\n{fit_errors_summary}"
    366             )
--> 367             raise ValueError(all_fits_failed_message)
    368 
    369         else:

ValueError: 
All the 6 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.8 GiB for an array with shape (16, 350343565) and data type float64

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 44.4 GiB for an array with shape (17, 350343565) and data type float64

我能做些什么来防止这个错误，它是如何产生的？ 我尝试在我记忆中的管道上使用 sklearn 的克隆函数，然后调用 fit，但我得到了同样的错误。 但是，当我创建一个新管道（仍在同一个会话中）并对其调用 fit 时，它确实有效。

Answer 1

问题是你正在做的巨大的基础扩展。 为 34 个特征添加 9 次多项式特征会产生 52,451,256 个特征。 即使您只有少量样本，也难怪您的内存不足。

只需看看 2 PolynomialFeatures特征为您提供的 4 个特征：

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline

>>> arr = np.random.random(size=(10, 4))
>>> poly = PolynomialFeatures(degree=2).fit(arr)
>>> poly.get_feature_names()

这导致：

['1',
 'x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

如果您在 20 个数据实例上使用 52 个特征，您很可能会进入过度拟合领域。 即使是数据上的 2 次多项式也会为您提供 630 个特征，这实在是太多了。 我会使用检查（例如配对图）、特征重要性，也许还有 PCA 来降低维度，然后放弃基础扩展，直到你知道事情的发展方向。

对于大量特征和高次多项式，可能无法向sklearn请求列表，例如为了计算它们。 您可以改为使用scipy的二项式系数函数来计算它：

>>> from scipy.special import binom
>>> binom(34, 9)
52451256.0

如果您不想包含X的幂，只包含产品，您可以指定interaction_only=True 。 这将产生更少的功能，但不会太多。

Answer 2

MemoryError 意味着 Python 解释器用完了 RAM 和交换空间来分配新的内存。 通常解决方案包括 1) 使用较小的数据集 2) 获得具有更多 RAM 的计算机。 3）检查您的代码不会泄漏内存。

拟合 scikit-learn 模型时出现 Python MemoryError

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-06-08 18:07:44

解决方案2
0 2022-06-08 14:41:27

拟合 scikit-learn 模型时出现 Python MemoryError

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-06-08 18:07:44

解决方案2 0 2022-06-08 14:41:27

解决方案1
2 已采纳 2022-06-08 18:07:44

解决方案2
0 2022-06-08 14:41:27