Logistic回归sklearn-训练和应用模型

Question

I'm new to machine learning and trying Sklearn for the first time. 我是机器学习的新手，也是第一次尝试Sklearn。 I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model. 我有两个数据框，一个数据框用于训练逻辑回归模型（具有10倍交叉验证），另一个数据框用于使用该模型预测类（“ 0,1”）。 Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web: 到目前为止，这是我的代码，使用了我在Sklearn文档和Web上发现的一些教程：

import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics


# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)

# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')

# Scores
df_data = df.ix[:,:-1].values

# Target
df_target = df.ix[:,-1].values

# Values to predict
df_test = df_pred.ix[:,:-1].values

# Scores' names
df_data_names = cols.values

# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target

# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

# Logistic regression normalizing variables
LogReg = LogisticRegression()

# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores

# Predict new
novel = LogReg.predict(X_pred)

Is this the correct way to implement a Logistic Regression? 这是实现Logistic回归的正确方法吗？ I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. 我知道在交叉验证之后应该使用fit（）方法，以便训练模型并将其用于预测。 However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions. 但是，由于我在列表理解内调用了fit（），所以我真的不知道我的模型是否“适合”并可以用来进行预测。

Answer 1

I general things are okay, but there are some problems. 我一般情况还可以，但是有一些问题。

Scaling 缩放

 X, X_pred, y = scale(df_data), scale(df_test), df_target

You scale training and test data independently, which isn't correct. 您单独缩放培训和测试数据，这是不正确的。 Both datasets must be scaled with the same scaler. 两个数据集必须使用相同的缩放器缩放。 "Scale" is a simple function, but it is better to use something else, for example StandardScaler. “ Scale”是一个简单的函数，但最好使用其他功能，例如StandardScaler。

scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)

Cross-validation and predicting. 交叉验证和预测。 How your code works? 您的代码如何工作？ You split data 10 times into train and hold-out set; 您将数据分为训练和保持集10次； 10 times fit model on train set and calculate score on hold-out set. 将模型拟合到火车上10次，并计算保持套上的得分。 This way you get cross-validation scores, but the model is fitted only on a part of data. 这样，您将获得交叉验证得分，但模型仅适用于部分数据。 So it would be better to fit model on the whole dataset and then make a prediction: 因此，最好在整个数据集上拟合模型，然后做出预测：
```
 LogReg.fit(X, y) novel = LogReg.predict(X_pred) 
```

I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics. 我想注意到有一些先进的技术，例如堆叠和提升，但是如果您学习使用sklearn，则最好坚持基础知识。

Logistic回归sklearn-训练和应用模型

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-11-18 04:05:13

Logistic回归sklearn-训练和应用模型

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-11-18 04:05:13

解决方案1
1 已采纳 2017-11-18 04:05:13