如何使用 skLearn 构建基线 model 以预测具有多个值的 Y

Question

I have a sample data frame that looks like below.我有一个如下所示的示例数据框。 I would like to build a baseline model to predict y_combined using X=df.filter(regex='x_') :我想建立一个基线 model 来预测y_combined使用X=df.filter(regex='x_') ：

df = pd.DataFrame({
    'x_1':[0.1,0.2,0.1,0],
    'x_2':[0.5,0.1,0.3,0.4],
    'x_3':[0.2,0.1,0.6,0.1],
    'x_4':[0,0.5,0.2,0.3],
    'y_1': [0, 1, 1, 0],
    'y_2': [0, 0, 1, 0],
    'y_3': [0, 1, 0, 1],
    'y_combined': [np.array([0, 0, 0]), np.array([1, 0, 1]),
                   np.array([1, 1, 0]), np.array([0, 0, 1])]
})

I am new to the baseline model building.我是基线 model 大楼的新手。 To obtain y_predicted , how should I specify the DummyClassifer() model with strategy="constant" ?要获得y_predicted ，我应该如何使用strategy="constant"指定 DummyClassifer() model ？ Or is there a different strategy I should be using?还是我应该使用不同的策略？

For example, if the y_predicted = [1,1,1] , then I will see how well the prediction model performs by getting the average centroid between y_combined and y_predicted .例如，如果y_predicted = [1,1,1] ，那么我将通过获取y_combined和y_predicted之间的平均质心来查看预测 model 的执行情况。

Answer 1

I will answer your question under the premise that you want to use the DummyClassifier using the strategy='constant' setting to build a baseline model for a multilabel classification problem, where the output equals y_combined of df .我将在您想使用DummyClassifier使用strategy='constant'设置为多标签分类问题构建基线 model 的前提下回答您的问题，其中 output 等于y_combined的df 。 In this case, the following code will work:在这种情况下，以下代码将起作用：

from sklearn.dummy import DummyClassifier
import numpy as np
import pandas as pd


X = pd.DataFrame({
    'x_1': [0.1,0.2,0.1,0],
    'x_2': [0.5,0.1,0.3,0.4],
    'x_3': [0.2,0.1,0.6,0.1],
    'x_4': [0,0.5,0.2,0.3]
})
y = np.array([[0, 0, 0], [1, 0, 1], [1, 1, 0], [0, 0, 1]])

clf = DummyClassifier(strategy='constant', constant=np.array([1, 1, 1]))
clf.fit(X, y)

Notice that when you use strategy='constant' you also have to state the constant value that should be predicted via the constant=... parameter of the DummyClassifier .请注意，当您使用strategy='constant'时，您还必须 state 应该通过DummyClassifier的constant=...参数预测的常量值。 You will see that the baseline model will now always predict the specified constant value, no matter the input:您将看到基线 model 现在将始终预测指定的常量值，无论输入如何：

y_pred = clf.predict(X)
print(y_pred)

# output
[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]

Since you also asked about other strategies, you can basically choose any of the strategies mentioned in the documentation of DummyClassifier .由于您还询问了其他策略，因此您基本上可以选择DummyClassifier文档中提到的任何策略。 All have a common behaviour, as mentioned in the user guide:正如用户指南中所述，它们都有一个共同的行为：

Note that with all these strategies, the predict method completely ignores the input data!请注意，对于所有这些策略，predict 方法完全忽略了输入数据！

如何使用 skLearn 构建基线 model 以预测具有多个值的 Y

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-17 20:41:31

如何使用 skLearn 构建基线 model 以预测具有多个值的 Y

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-17 20:41:31

解决方案1
1 已采纳 2021-05-17 20:41:31