scikit-learn - 以概率为目标变量的多项逻辑回归

Question

I'm implementing a multinomial logistic regression model in Python using scikit-learn.我正在使用 scikit-learn 在 Python 中实现多项逻辑回归模型。 The thing is, however, that I'd like to use probability distribution for classes of my target variable.然而，问题是我想对我的目标变量的类使用概率分布。 As an example let's say that this is a 3-classes variable which looks as follows:例如，假设这是一个 3 类变量，如下所示：

    class_1 class_2 class_3
0   0.0     0.0     1.0
1   1.0     0.0     0.0
2   0.0     0.5     0.5
3   0.2     0.3     0.5
4   0.5     0.1     0.4

So that a sum of values for every row equals to 1.因此每行的值之和等于 1。

How could I fit a model like this?我怎么能适应这样的模型？ When I try:当我尝试：

model = LogisticRegression(solver='saga', multi_class='multinomial')
model.fit(X, probabilities)

I get an error saying:我收到一条错误消息：

ValueError: bad input shape (10000, 3)

Which I know is related to the fact that this method expects a vector, not a matrix.我知道这与此方法需要向量而不是矩阵的事实有关。 But here I can't compress the probabilities matrix into vector since the classes are not exclusive.但在这里我不能将probabilities矩阵压缩成向量，因为这些类不是唯一的。

Answer 1

You can't have cross-entropy loss with non-indicator probabilities in scikit-learn;在 scikit-learn 中，你不能有非指标概率的交叉熵损失； this is not implemented and not supported in API.这在 API 中未实现且不受支持。 It is a scikit-learn's limitation.这是 scikit-learn 的局限性。

For logistic regression you can approximate it by upsampling instances according to probabilities of their labels.对于逻辑回归，您可以通过根据标签的概率对实例进行上采样来对其进行近似。 For example, you can up-sample every instance 10x: eg if for a training instance class 1 has probability 0.2, and class 2 has probability 0.8, generate 10 training instances: 8 with class 2 and 2 with class 1. It won't be as efficient as it could be, but in a limit you'll be optimizing the same objective function.例如，您可以将每个实例上采样 10 倍：例如，如果训练实例类别 1 的概率为 0.2，而类别 2 的概率为 0.8，则生成 10 个训练实例：8 个属于类别 2，2 个属于类别 1。它不会尽可能高效，但在一定限度内，您将优化相同的目标函数。

You can do something like this:你可以这样做：

from sklearn.utils import check_random_state
import numpy as np

def expand_dataset(X, y_proba, factor=10, random_state=None):
    """
    Convert a dataset with float multiclass probabilities to a dataset
    with indicator probabilities by duplicating X rows and sampling
    true labels.
    """
    rng = check_random_state(random_state)
    n_classes = y_proba.shape[1]
    classes = np.arange(n_classes, dtype=int)
    for x, probs in zip(X, y_proba):
        for label in rng.choice(classes, size=factor, p=probs):
            yield x, label

See a more complete example here: https://github.com/TeamHG-Memex/eli5/blob/8cde96878f14c8f46e10627190abd9eb9e705ed4/eli5/lime/utils.py#L16在此处查看更完整的示例： https : //github.com/TeamHG-Memex/eli5/blob/8cde96878f14c8f46e10627190abd9eb9e705ed4/eli5/lime/utils.py#L16

Alternatively, you can implement your Logistic Regression using libraries like TensorFlow or PyTorch;或者，您可以使用 TensorFlow 或 PyTorch 等库实现逻辑回归； unlike scikit-learn, it is easy to define any loss in these frameworks, and cross-entropy is available out of box.与 scikit-learn 不同，在这些框架中很容易定义任何损失，并且开箱即用的交叉熵是可用的。

Answer 2

You need to input the correct labels with the training data, and then the logistic regression model will give you probabilities in return when you use predict_proba(X), and it would return a matrix of shape [n_samples, n_classes].你需要用训练数据输入正确的标签，然后逻辑回归模型会在你使用 predict_proba(X) 时给你概率作为回报，它会返回一个形状为 [n_samples, n_classes] 的矩阵。 If you use a just predict(X) then it would give you an array of the most probable class in shape [n_samples,1]如果你使用一个 just predict(X) 那么它会给你一个形状最可能的类的数组 [n_samples,1]

scikit-learn - 以概率为目标变量的多项逻辑回归

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-10-30 16:26:50

解决方案2
0 2017-10-28 09:05:15

scikit-learn - 以概率为目标变量的多项逻辑回归

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-10-30 16:26:50

解决方案2 0 2017-10-28 09:05:15

解决方案1
3 已采纳 2017-10-30 16:26:50

解决方案2
0 2017-10-28 09:05:15