简体   繁体   English

如何将二进制类Logistic回归与Python合并

[英]How to merge binary class Logistic Regression with Python

I have a video games Dataset with many categorical columns. 我有一个包含许多分类列的视频游戏数据集。 I binarized all these columns. 我对所有这些列进行了二值化。 Now I want to predict a column (called Rating) with Logistic Regression, but this columns is now actually binarized into 4 columns (Rating_Everyone, Rating_Everyone10+, Rating_Teen and Rating_Mature). 现在,我想预测一个具有Logistic回归的列(称为Rating),但实际上该列现在已二值化为4列(Rating_Everyone,Rating_Everyone10 +,Rating_Teen和Rating_Mature)。 So, I applied four times the Logistic Regression and here is my code: 因此,我应用了四次Logistic回归,这是我的代码:

df2 = pd.read_csv('../MQPI/docs/Video_Games_Sales_as_at_22_Dec_2016.csv', encoding="utf-8")
y = df2['Rating_Everyone'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)

log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
                             class_weight=None, random_state=None, solver='liblinear', max_iter=100,
                             multi_class='ovr',
                             verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)

print("Logistic Regression Rating_Everyone accuracy: ", ris)

And again: 然后再次:

y = df2['Rating_Everyone10'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)

log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
                             class_weight=None, random_state=None, solver='liblinear', max_iter=100,
                             multi_class='ovr',
                             verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)

print("Logistic Regression Rating_Everyone accuracy: ", ris)

And so on for Rating_Teen and Rating_Mature. 以此类推,用于Rating_Teen和Rating_Mature。 Can you tell me how to merge all these four results into one result OR how can I do this multiclass Logistic Regression problem better? 您能否告诉我如何将所有四个结果合并为一个结果,或者如何更好地解决这个多类Logistic回归问题?

The LogisticRegression model is inherently handle multiclass problems: LogisticRegression模型固有地处理多类问题:

Below is a summary of the classifiers supported by scikit-learn grouped by strategy; 以下是scikit-learn按策略分组的分类器的摘要; you don't need the meta-estimators in this class if you're using one of these, unless you want custom multiclass 如果您要使用此类之一,则不需要此类中的元估计器,除非您需要自定义多类

behavior: Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, setting multi_class='multinomial' in sklearn.linear_model.LogisticRegression. 行为:固有的多类:朴素贝叶斯,LDA和QDA,决策树,随机森林,最近的邻居,在sklearn.linear_model.LogisticRegression中设置multi_class ='multinomial'。

As a basic model, without class weighting (as you may need to do as samples may not be balanced over the ratings) set multi_class='multinomial' and change the solver to 'lbfgs' or one of the other solvers that support multiclass problems: 作为基本模型,如果没有类别加权(您可能需要做的那样,样本可能无法在评级上保持平衡),请设置multi_class='multinomial'并将求解器更改为'lbfgs'或支持多类问题的其他求解器之一:

For multiclass problems, only 'newton-cg', 'sag' and 'lbfgs' handle multinomial loss; 对于多类问题,只有'newton-cg','sag'和'lbfgs'处理多项式损失。 'liblinear' is limited to one-versus-rest schemes “ liblinear”仅限于“一站式”计划

So you dont have to have to split your datasets up the way you have. 因此,您不必按自己的方式拆分数据集。 Instead provide the original ratings column as the the labels. 而是提供原始评级列作为标签。

Here is a minimal example: 这是一个最小的示例:

X = np.random.randn(10, 10)
y = np.random.randint(1, 4, size=10) # 3 classes simulating ratings
lg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lg.fit(X, y)
lg.predict(X)

Edit: responding to comment. 编辑:回应评论。

td;lr : I expect that the model will learn that interaction on it own. td; lr :我希望模型可以自己学习交互作用。 IF not you might encode that information as a feature . 如果不是,您可能会将这些信息编码为功能 So there is no obvious need to binarize your classes. 因此, 没有必要对您的课程进行二值化。

The way I understand it that you have features of a movies and you have the MPAA rating for the movie as the label (which you're trying to predict). 据我了解,您具有电影的功能,并且电影具有MPAA评级作为标签(您正在尝试预测)。 This is then a multiclass problem which you can start modeling using logistic regression ( this you knew ). 这是一个多类问题,您可以使用逻辑回归开始建模(您知道这一点)。 This is the model I proposed in above. 这就是我上面提出的模型。

Now you recognized that there is a implicit distance between classes. 现在您认识到类之间存在隐式距离。 The way I would use this information is as a feature for the model. 我将使用此信息的方式是模型的功能。 However, I'd first be inclined to see of the model will learn this on its own. 但是,我首先倾向于看到该模型将自己学习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM