Sklearn 0.20+的交叉验证？

Question

I am trying to do cross validation and I am running into an error that says: 'Found input variables with inconsistent numbers of samples: [18, 1]' 我正在尝试进行交叉验证，并且遇到一个错误：“找到的输入变量样本数量不一致：[18，1]”

I am using different columns in a pandas data frame (df) as the features, with the last column as the label. 我在pandas数据框（df）中使用不同的列作为要素，最后一列作为标签。 This is derived from the machine learning repository for UC Irvine. 这来自UC Irvine的机器学习存储库。 When importing the cross-validation package that I have used in the past, I am getting an error that it may have depreciated. 导入我过去使用的交叉验证程序包时，我收到一个错误消息，说明它可能已贬值。 I am going to be running a decision tree, SVM, and K-NN. 我将运行决策树，SVM和K-NN。

My code is as such: 我的代码是这样的：

feature = [df['age'], df['job'], df['marital'], df['education'], df['default'], df['housing'], df['loan'], df['contact'],
       df['month'], df['day_of_week'], df['campaign'], df['pdays'], df['previous'], df['emp.var.rate'], df['cons.price.idx'],
       df['cons.conf.idx'], df['euribor3m'], df['nr.employed']]
label = [df['y']]

from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
# Model Training 
x = feature[:]
y = label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

Any help would be great! 任何帮助将是巨大的！

Answer 1

cross_validation module is deprecated. cross_validation模块已弃用。 The new module model_selection has taken its place. 新模块model_selection取代了它。 So everything you did with cross_validation . 因此，您使用cross_validation所做的一切。 is now available in model_selection . 现在可以在model_selection 。 Then your above code becomes: 然后，您上面的代码将变为：

feature = [df['age'], df['job'], df['marital'], df['education'], df['default'], df['housing'], df['loan'], df['contact'],
       df['month'], df['day_of_week'], df['campaign'], df['pdays'], df['previous'], df['emp.var.rate'], df['cons.price.idx'],
       df['cons.conf.idx'], df['euribor3m'], df['nr.employed']]
label = [df['y']]

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Now as far as declaring the X and y is concerned, why are you wrapping them in a list. 现在，就声明X和y而言，为什么要将它们包装在列表中。 Just use them like this: 像这样使用它们：

feature = df[['age', 'job', 'marital', 'education', 'default', 'housing', 
              'loan', 'contact', 'month', 'day_of_week', 'campaign', 
              'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
              'cons.conf.idx', 'euribor3m', 'nr.employed']]
label = df['y']

And then you can simply use your code, without changing anything. 然后，您可以简单地使用您的代码，而无需进行任何更改。

# Model Training 
x = feature[:]
y = label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

And for your last question about folds in cross-validation, there are multiple classes in sklearn which does this (depending upon task). 对于您关于交叉验证折叠的最后一个问题，sklearn中有多个类可以做到这一点（取决于任务）。 Please have a look at: 请看一下：

http://scikit-learn.org/stable/modules/classes.html#splitter-classes http://scikit-learn.org/stable/modules/classes.html#splitter-classes

Which contains fold iterators. 其中包含折叠迭代器。 And remember, all this is present in model_selection package. 记住，所有这些都存在于model_selection包中。

Answer 2

The items in your feature list are pandas Series. feature列表中的项目是“熊猫系列”。 You don't need to list out each feature in a list like you have done; 您无需像完成操作一样在列表中列出每个功能。 you just need to pass them all as a single "table". 您只需要将它们全部作为一个“表”传递即可。

For example, this looks like the bank dataset so: 例如，这看起来像银行数据集，所以：

df = pd.read_csv('bank.csv', sep=';')
#df.shape
#(4521, 17)
#df.columns
#Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
#       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
#       'previous', 'poutcome', 'y'],
#      dtype='object')

x = df.iloc[:, :-1]
y = df.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

Should work. 应该管用。 The only thing to notice here is that x is a DataFrame with 16 columns but its underlying data is a numpy ndarray - not a list of Series but a single "matrix". 这里唯一需要注意的是x是一个具有16列的DataFrame，但是其基础数据是一个numpy ndarray-不是Series的列表，而是一个“矩阵”。

Sklearn 0.20+的交叉验证？

问题描述

2 个解决方案

解决方案1
5 已采纳 2017-11-14 02:10:28

解决方案2
1 2017-11-13 21:04:09

Sklearn 0.20+的交叉验证？

问题描述

2 个解决方案

解决方案1 5 已采纳 2017-11-14 02:10:28

解决方案2 1 2017-11-13 21:04:09

解决方案1
5 已采纳 2017-11-14 02:10:28

解决方案2
1 2017-11-13 21:04:09