[英]Code enters infinite loop when trying to select features
I am trying to use scikit learn's Recursive feature elimination with cross-validation for a (5000, 37)
data that has binary class problem and whenever i fit the model the algorithm enters infinite loop.我正在尝试使用 scikit learn 的递归特征消除和交叉验证来处理具有二进制 class 问题的
(5000, 37)
数据,并且只要我适合 model,算法就会进入无限循环。 Currently, i am following this example: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html on how to employ this algorithm.目前,我正在关注这个例子: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html关于如何使用这个算法。
My data is:我的数据是:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
X = np.random.randint(0,363175645.191632,size=(5000, 37))
Y = np.random.choice([0, 1], size=(37,))
What i tried doing to select the features by:我尝试通过以下方式对 select 的功能做些什么:
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
rfecv.fit(X, Y)
The code hangs and enters infinite loop, however when i try using another algorithm such as ExtraTreesClassifier it works just fine, what is going on, please help?代码挂起并进入无限循环,但是当我尝试使用另一种算法(如 ExtraTreesClassifier)时,它工作得很好,这是怎么回事,请帮忙?
When you perform svm, because it is distance based, it makes sense to scale your feature variables, especially in your case when they are huge.当你执行 svm 时,因为它是基于距离的,所以缩放你的特征变量是有意义的,尤其是当它们很大的时候。 you can also check out this intro to svm .
您还可以查看此 svm 简介。 Using an example dataset:
使用示例数据集:
from sklearn.datasets import make_blobs
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
X, y = make_blobs(n_samples=5000, centers=3, shuffle=False,random_state=42)
X = np.concatenate((X,np.random.randint(0,363175645.191632,size=(5000,35))),axis=1)
y = (y==1).astype('int')
X_scaled = Scaler.fit_transform(X)
This dataset has only 2 useful variables in the first two columns, as you can see from the plot:这个数据集的前两列只有 2 个有用的变量,从 plot 可以看出:
plt.scatter(x=X_scaled[:,0],y=X_scaled[:,1],c=['k' if i else 'b' for i in y])
Now we run rfe on scaled data and we can see it returns the first two columns as top variables:现在我们在缩放数据上运行 rfe,我们可以看到它返回前两列作为顶部变量:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),scoring='accuracy')
rfecv.fit(X_scaled, y)
rfecv.ranking_
array([ 1, 2, 17, 28, 33, 22, 23, 26, 6, 19, 20, 4, 10, 25, 3, 27, 11,
8, 18, 5, 29, 14, 7, 21, 9, 13, 24, 30, 35, 31, 32, 34, 16, 36,
37, 12, 15])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.