简体   繁体   English

scikitlearn中的Logistic回归

[英]Logistic Regression in scikitlearn

How do you handle graphs like this: 你如何处理这样的图形: 在此输入图像描述

using scikitlearn's LogisticRegression model. 使用scikitlearn的LogisticRegression模型。 Is there a way to handle these sorts of problems easily using scikitlearn and a standard X, y input that maps to a graph like this? 有没有办法使用scikitlearn和标准的X,y输入轻松处理这些问题,这些输入映射到这样的图形?

A promising approach if you really want to use Logistic Regression for this particular setting would be to transform your coordinates from Cartesian system to Polar system. 如果您真的想要对此特定设置使用Logistic回归,那么有前途的方法是将坐标从笛卡尔系统转换为Polar系统。 From the visualization, it seems that in that systems you data will be (almost) linearly separable. 从可视化看来,在该系统中,您的数据似乎(几乎)可线性分离。

This can be done as described here: Python conversion between coordinates 这可以按照此处所述完成: 坐标之间的Python转换

As others said, Logistic Regression can't handle this kind of data well because it is a linear classifier. 正如其他人所说,Logistic回归无法很好地处理这类数据,因为它是一个线性分类器。 You may transform data to make it linearly separable, or choose another classifier which is better for such kind of data. 您可以转换数据以使其可线性分离,或选择另一种更适合此类数据的分类器。

There is a nice visualisation of how various classifiers handle this problem in scikit-learn docs: see http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html . 在scikit-learn文档中,各种分类器如何处理此问题有一个很好的可视化:请参阅http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Second row is for your task: 第二行是您的任务:

在此输入图像描述

There have been a couple of answers already, but neither of them have mentioned any preprocessing of the data. 已经有几个答案,但他们都没有提到任何数据的预处理。 So I will show both ways of looking at your problem. 因此,我将展示两种方式来查看您的问题。

First up I'll look at some manifold learning to transform you data into another space 首先,我将看一些流量学习,将您的数据转换为另一个空间

# Do some imports that I'll be using
from sklearn import datasets, manifold, linear_model
from sklearn import model_selection, ensemble, metrics
from matplotlib import pyplot as plt

%matplotlib inline

# Make some data that looks like yours
X, y = datasets.make_circles(n_samples=200, factor=.5,
                             noise=.05)

First of all let's look at your current problem 首先让我们来看看你当前的问题

plt.scatter(X[:, 0], X[:, 1], c=y)
clf = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(clf, X, y)
print scores.mean()

Outputs : 产出

您的数据的散点图

0.440433749257

So you can see this data looks like yours and we get a terrible cross-validated accuracy with logistic regression. 所以你可以看到这些数据看起来像你的,我们通过逻辑回归得到了一个糟糕的交叉验证精度。 So if you're really attached the logistic regression, what we can do is project your data into a different space using some sort of manifold learning, for example: 因此,如果你真的附加了逻辑回归,我们可以做的是使用某种流形学习将数据投影到不同的空间,例如:

Xd = manifold.LocallyLinearEmbedding().fit_transform(X)
plt.scatter(Xd[:, 0], Xd[:, 1], c=y)
clf = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(clf, Xd, y)
print scores.mean()

Outputs : 产出

在此输入图像描述

1.0

So you can see that now your data is perfectally linearly seperable from the LocallyLinearEmbedding we get a much better classifier accuracy! 所以你可以看到,现在你的数据是perfectally从线性可分LocallyLinearEmbedding我们得到更好的分类准确度!

The other option that is available to you, that's been mentioned by other people is using a different model. 其他人提到的另一个选项是使用不同的模型。 While there are many options avaiable to you, I'm just going to show an example using RandomForestClassifier . 虽然您可以使用许多选项,但我将使用RandomForestClassifier来展示一个示例。 I'm only going to train on half the data so we can evaluate the accuracy on an unbias set. 我只会训练一半的数据,所以我们可以评估unbias集的准确性。 I only used CV previously because it's quick and easy! 我之前只使用过CV,因为它快速而简单!

clf = ensemble.RandomForestClassifier().fit(X[:100], y[:100])
print metrics.accuracy_score(y[100:], clf.predict(X[100:]))

Outputs : 产出

0.97

So we're getting a good accuracy! 所以我们得到了很好的准确性! If you're interested to see what's going on, we can lift some code from one of the awesome scikit-learn tutorials. 如果你有兴趣看看发生了什么,我们可以从一个很棒的 scikit-learn教程中提取一些代码。

plot_step = 0.02
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y)

Outputs : 产出

RF分类器的决策边界

So this shows the areas of your space that are being classified into each class using the Random Forest model. 因此,这显示了使用随机森林模型将您的空间区域分类到每个类中。

Two ways to solve the same problem. 解决同一问题的两种方法。 I leave working out which is best as an exercise to the reader... 我给读者留下最好的锻炼方法......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM