scikit-learn 中的多标签分类与超参数搜索：指定平均

Question

I am working on a simple multioutput classification problem and noticed this error showing up whenever running the below code:我正在研究一个简单的多输出分类问题，并注意到每当运行以下代码时都会出现此错误：

ValueError: Target is multilabel-indicator but average='binary'. Please 
choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'].

I understand the problem it is referencing, ie, when evaluating multilabel models one needs to explicitly set the type of averaging.我理解它所引用的问题，即在评估多标签模型时，需要明确设置平均类型。 Nevertheless, I am unable to figure out where this average argument should go to, since only accuracy_score , precision_score , recall_score built-in methods have this argument which I do not use explicitly in my code.尽管如此，我无法弄清楚这个average参数应该 go 到哪里，因为只有accuracy_score 、 precision_score 、 recall_score内置方法有这个参数，我没有在我的代码中明确使用。 Moreover, since I am doing a RandomizedSearch , I cannot just pass a precision_score(average='micro') to the scoring or refit arguments either, since precision_score() requires correct and true y labels to be passed.此外，由于我正在执行RandomizedSearch ，因此我不能只将precision_score(average='micro')传递给scoring或refit调整 arguments ，因为precision_score()需要传递正确且真实的y标签。 This is why this former SO question and this one here , both with a similar issue, didn't help.这就是为什么这个以前的 SO question和这个 here都有类似的问题，没有帮助。

My code with example data generation is as follows:我的示例数据生成代码如下：

from sklearn.datasets import make_multilabel_classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

X, Y = make_multilabel_classification(
    n_samples=1000,
    n_features=2,
    n_classes=5,
    n_labels=2
)

pipe = Pipeline(
    steps = [
        ('scaler', MinMaxScaler()),
        ('model', MultiOutputClassifier(MultinomialNB()))
    ]
)

search = RandomizedSearchCV(
    estimator = pipe,
    param_distributions={'model__estimator__alpha': (0.01,1)},
    scoring = ['accuracy', 'precision', 'recall'],
    refit = 'precision',
    cv = 5
).fit(X, Y)

What am I missing?我错过了什么？

Answer 1

From the scikit-learn docs, I see that you can pass a callable that returns a dictionary where the keys are the metric names and the values are the metric scores.从 scikit-learn 文档中，我看到您可以传递一个可调用的函数，该函数返回一个字典，其中键是指标名称，值是指标分数。 This means you can write your own scoring function, which has to take the estimator, X_test , and y_test as inputs.这意味着您可以编写自己的评分 function，它必须将估计器X_test和y_test作为输入。 This in turn must compute y_pred and use that to compute the scores you want to use.这反过来必须计算 y_pred 并使用它来计算您想要使用的分数。 This you can do doing the built-in methods.这你可以做内置的方法。 There, you can specify which keyword arguments should be used to compute the scores.在那里，您可以指定应该使用哪个关键字 arguments 来计算分数。 In code that would look like在看起来像的代码中

def my_scorer(estimator, X_test, y_test) -> dict[str, float]:
    y_pred = estimator.predict(X_test)
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='micro'),
        'recall': recall_score(y_test, y_pred, average='micro'),
    }

search = RandomizedSearchCV(
    estimator = pipe,
    param_distributions={'model__estimator__alpha': (0.01,1)},
    scoring = my_scorer,
    refit = 'precision',
    cv = 5
).fit(X, Y)

Answer 2

From the table of scoring metrics , note f1_micro , f1_macro , etc., and the notes "suffixes apply as with 'f1'" given for precision and recall .从评分指标表中，请注意f1_micro 、 f1_macro等，以及针对precision和recall给出的注释“后缀适用于 'f1'”。 So eg所以例如

search = RandomizedSearchCV(
    ...
    scoring = ['accuracy', 'precision_micro', 'recall_macro'],
    ...
)

scikit-learn 中的多标签分类与超参数搜索：指定平均

问题描述

2 个解决方案

解决方案1
0 2022-02-05 15:18:52

解决方案2
0 2022-02-06 03:33:47

scikit-learn 中的多标签分类与超参数搜索：指定平均

问题描述

2 个解决方案

解决方案1 0 2022-02-05 15:18:52

解决方案2 0 2022-02-06 03:33:47

解决方案1
0 2022-02-05 15:18:52

解决方案2
0 2022-02-06 03:33:47