如何将二进制类Logistic回归与Python合并

Question

I have a video games Dataset with many categorical columns. 我有一个包含许多分类列的视频游戏数据集。 I binarized all these columns. 我对所有这些列进行了二值化。 Now I want to predict a column (called Rating) with Logistic Regression, but this columns is now actually binarized into 4 columns (Rating_Everyone, Rating_Everyone10+, Rating_Teen and Rating_Mature). 现在，我想预测一个具有Logistic回归的列（称为Rating），但实际上该列现在已二值化为4列（Rating_Everyone，Rating_Everyone10 +，Rating_Teen和Rating_Mature）。 So, I applied four times the Logistic Regression and here is my code: 因此，我应用了四次Logistic回归，这是我的代码：

df2 = pd.read_csv('../MQPI/docs/Video_Games_Sales_as_at_22_Dec_2016.csv', encoding="utf-8")
y = df2['Rating_Everyone'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)

log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
                             class_weight=None, random_state=None, solver='liblinear', max_iter=100,
                             multi_class='ovr',
                             verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)

print("Logistic Regression Rating_Everyone accuracy: ", ris)

And again: 然后再次：

y = df2['Rating_Everyone10'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)

log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
                             class_weight=None, random_state=None, solver='liblinear', max_iter=100,
                             multi_class='ovr',
                             verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)

print("Logistic Regression Rating_Everyone accuracy: ", ris)

And so on for Rating_Teen and Rating_Mature. 以此类推，用于Rating_Teen和Rating_Mature。 Can you tell me how to merge all these four results into one result OR how can I do this multiclass Logistic Regression problem better? 您能否告诉我如何将所有四个结果合并为一个结果，或者如何更好地解决这个多类Logistic回归问题？

Answer 1

The LogisticRegression model is inherently handle multiclass problems: LogisticRegression模型固有地处理多类问题：

Below is a summary of the classifiers supported by scikit-learn grouped by strategy; 以下是scikit-learn按策略分组的分类器的摘要； you don't need the meta-estimators in this class if you're using one of these, unless you want custom multiclass 如果您要使用此类之一，则不需要此类中的元估计器，除非您需要自定义多类

behavior: Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, setting multi_class='multinomial' in sklearn.linear_model.LogisticRegression. 行为：固有的多类：朴素贝叶斯，LDA和QDA，决策树，随机森林，最近的邻居，在sklearn.linear_model.LogisticRegression中设置multi_class ='multinomial'。

As a basic model, without class weighting (as you may need to do as samples may not be balanced over the ratings) set multi_class='multinomial' and change the solver to 'lbfgs' or one of the other solvers that support multiclass problems: 作为基本模型，如果没有类别加权（您可能需要做的那样，样本可能无法在评级上保持平衡），请设置multi_class='multinomial'并将求解器更改为'lbfgs'或支持多类问题的其他求解器之一：

For multiclass problems, only 'newton-cg', 'sag' and 'lbfgs' handle multinomial loss; 对于多类问题，只有'newton-cg'，'sag'和'lbfgs'处理多项式损失。 'liblinear' is limited to one-versus-rest schemes “ liblinear”仅限于“一站式”计划

So you dont have to have to split your datasets up the way you have. 因此，您不必按自己的方式拆分数据集。 Instead provide the original ratings column as the the labels. 而是提供原始评级列作为标签。

Here is a minimal example: 这是一个最小的示例：

X = np.random.randn(10, 10)
y = np.random.randint(1, 4, size=10) # 3 classes simulating ratings
lg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lg.fit(X, y)
lg.predict(X)

Edit: responding to comment. 编辑：回应评论。

td;lr : I expect that the model will learn that interaction on it own. td; lr ：我希望模型可以自己学习交互作用。 IF not you might encode that information as a feature . 如果不是，您可能会将这些信息编码为功能。 So there is no obvious need to binarize your classes. 因此，没有必要对您的课程进行二值化。

The way I understand it that you have features of a movies and you have the MPAA rating for the movie as the label (which you're trying to predict). 据我了解，您具有电影的功能，并且电影具有MPAA评级作为标签（您正在尝试预测）。 This is then a multiclass problem which you can start modeling using logistic regression ( this you knew ). 这是一个多类问题，您可以使用逻辑回归开始建模（您知道这一点）。 This is the model I proposed in above. 这就是我上面提出的模型。

Now you recognized that there is a implicit distance between classes. 现在您认识到类之间存在隐式距离。 The way I would use this information is as a feature for the model. 我将使用此信息的方式是模型的功能。 However, I'd first be inclined to see of the model will learn this on its own. 但是，我首先倾向于看到该模型将自己学习。

如何将二进制类Logistic回归与Python合并

问题描述

1 个解决方案

解决方案1
1 2017-04-09 23:24:08

如何将二进制类Logistic回归与Python合并

问题描述

1 个解决方案

解决方案1 1 2017-04-09 23:24:08

解决方案1
1 2017-04-09 23:24:08