简体   繁体   English

尝试在scikit-learn中并行化参数搜索会导致“ SystemError:PyObject_Call中没有错误的NULL结果”

[英]Trying to parallelize parameter search in scikit-learn leads to “SystemError: NULL result without error in PyObject_Call”

I'm using the sklearn.grid_search.RandomizedSearchCV class from scikit-learn 14.1, and I get an error when running the following code: 我正在使用scikit-learn 14.1中的sklearn.grid_search.RandomizedSearchCV类,并且在运行以下代码时出现错误:

X, y = load_svmlight_file(inputfile)

min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X.toarray())

parameters = {'kernel':'rbf', 'C':scipy.stats.expon(scale=100), 'gamma':scipy.stats.expon(scale=.1)}

svr = svm.SVC()

classifier = grid_search.RandomizedSearchCV(svr, parameters, n_jobs=8)
classifier.fit(X_scaled, y)

When I set the n_jobs parameter to more than 1, I get the following error output: 当我将n_jobs参数设置为大于1时,得到以下错误输出:

Traceback (most recent call last):
  File "./svm_training.py", line 185, in <module>
    main(sys.argv[1:])
  File "./svm_training.py", line 63, in main
    gridsearch(inputfile, kerneltype, parameterfile)
  File "./svm_training.py", line 85, in gridsearch
    classifier.fit(X_scaled, y)
  File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-    x86_64.egg/sklearn/grid_search.py", line 860, in fit
    return self._fit(X, y, sampled_params)
  File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/grid_search.py", line 493, in _fit
    for parameters in parameter_iterable
  File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
    self.retrieve()
  File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 419, in retrieve
    self._output.append(job.get())
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
SystemError: NULL result without error in PyObject_Call

It seems to have something to do with the python multiprocessing functionality, but I'm not sure how to work around it other than just implement the parallelization for the parameter search by hand. 它似乎与python多处理功能有关,但是我不确定如何解决它,而不仅仅是手动实现参数搜索的并行化。 Has anyone had a similar issue with trying to parallelize the randomized parameter search in that they were able to solve? 有人试图解决随机参数搜索问题时遇到类似的问题吗?

It turns out the problem was with the use of MinMaxScaler. 事实证明,问题出在使用MinMaxScaler。 Since MinMaxScaler only accepts dense arrays, I was translating the sparse representation of the feature vector to a dense array before scaling. 由于MinMaxScaler仅接受密集数组,因此我在缩放之前将特征向量的稀疏表示转换为密集数组。 Since the feature vector has thousands of elements, my assumption is that the dense arrays caused a memory error when trying to parallelize the parameter search. 由于特征向量具有数千个元素,因此我的假设是,密集数组在尝试并行化参数搜索时会导致存储错误。 Instead, I switched to StandardScaler, which accepts sparse arrays as input, and should be better for use with my problem space anyway. 取而代之的是,我切换到StandardScaler,它接受稀疏数组作为输入,并且无论如何应该更好地用于问题空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM