如何为 catboost 创建自定义评估指标？

Question

类似的问题：

Python Catboost：多类 F1 分数自定义指标

Catboost 教程

https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function

题

在这个问题中，我有一个二元分类问题。 建模后我们得到测试 model 预测y_pred并且我们已经有了真实的测试标签y_true 。

我想获得由以下等式定义的自定义评估指标：

profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive

此外，由于利润越高越好，我想最大化 function 而不是最小化它。

如何在 catboost 中获取这个 eval_metric？

使用sklearn

def get_profit(y_true, y_pred):
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
    loss = 400*tp - 200*fn - 100*fp
    return loss

scoring = sklearn.metrics.make_scorer(get_profit, greater_is_better=True)

使用 catboost

class ProfitMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return True

    def evaluate(self, approxes, target, weight):
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        error_sum = 0.0
        weight_sum = 0.0

        ** I don't know here**

        return error_sum, weight_sum

题

如何在 catboost 中完成自定义 eval 指标？

更新

到目前为止我的更新

import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

def get_profit(y_true, y_pred):
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
    profit = 400*tp - 200*fn - 100*fp
    return profit


class ProfitMetric:
    def is_max_optimal(self):
        return True # greater is better

    def evaluate(self, approxes, target, weight):
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        y_pred = np.rint(approx)
        y_true = np.array(target).astype(int)

        output_weight = 1 # weight is not used

        score = get_profit(y_true, y_pred)
 
        return score, output_weight

    def get_final_error(self, error, weight):
        return error


df = sns.load_dataset('titanic')
X = df[['survived','pclass','age','sibsp','fare']]
y = X.pop('survived')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)


model = CatBoostClassifier(metric_period=50,
  n_estimators=200,
  eval_metric=ProfitMetric()
)

model.fit(X, y, eval_set=(X_test, y_test)) # this fails

Answer 1

与你的主要区别在于：

@staticmethod
def get_profit(y_true, y_pred):
    y_pred = expit(y_pred).astype(int)
    y_true = y_true.astype(int)
    #print("ACCURACY:",(y_pred==y_true).mean())
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    loss = 400*tp - 200*fn - 100*fp
    return loss

从您链接的示例中看不出什么是预测，但在检查后发现catboost在内部将预测视为原始对数赔率（帽子提示 @Ben）。 因此，要正确使用confusion_matrix ，您需要确保y_true和y_pred都是 integer class 标签。 这是通过以下方式完成的：

y_pred = scipy.special.expit(y_pred) 
y_true = y_true.astype(int)

所以完整的工作代码是：

import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from scipy.special import expit

df = sns.load_dataset('titanic')
X = df[['survived','pclass','age','sibsp','fare']]
y = X.pop('survived')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

class ProfitMetric:
    
    @staticmethod
    def get_profit(y_true, y_pred):
        y_pred = expit(y_pred).astype(int)
        y_true = y_true.astype(int)
        #print("ACCURACY:",(y_pred==y_true).mean())
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        loss = 400*tp - 200*fn - 100*fp
        return loss
    
    def is_max_optimal(self):
        return True # greater is better

    def evaluate(self, approxes, target, weight):            
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])
        y_true = np.array(target).astype(int)
        approx = approxes[0]
        score = self.get_profit(y_true, approx)
        return score, 1

    def get_final_error(self, error, weight):
        return error

model = CatBoostClassifier(metric_period=50,
  n_estimators=200,
  eval_metric=ProfitMetric()
)

model.fit(X, y, eval_set=(X_test, y_test))

Answer 2

例如，我实现了一个非常简单的指标。

它计算多类分类器中 y_pred.= y_true 的次数。

class CountErrors:
    '''Count of wrong predictions'''
    
    def is_max_optimal(self):
        False

    def evaluate(self, approxes, target, weight):  
        
        y_pred = np.array(approxes).argmax(0)
        y_true = np.array(target)
                                    
        return sum(y_pred!=y_true), 1

    def get_final_error(self, error, weight):
        return error

如果运行此代码，您可以看到它已被使用：

import numpy as np
import pandas as pd

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

class CountErrors:
    '''Count number of wrong predictions'''
    
    def is_max_optimal(self):
        False # Lower is better

    def evaluate(self, approxes, target, weight):  
        
        y_pred = np.array(approxes).argmax(0)
        y_true = np.array(target)
                                    
        return sum(y_pred!=y_true), 1

    def get_final_error(self, error, weight):
        return error
    

df = pd.read_csv('https://raw.githubusercontent.com/mkleinbort/resource-datasets/master/abalone/abalone.csv')
y = df['sex']
X = df.drop(columns=['sex'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

model = CatBoostClassifier(metric_period=50, n_estimators=200, eval_metric=CountErrors())

model.fit(X, y, eval_set=(X_test, y_test))

希望您可以根据您的用例进行调整。

如何为 catboost 创建自定义评估指标？

问题描述

题

使用sklearn

使用 catboost

题

更新

2 个解决方案

解决方案1
4 已采纳 2020-12-29 22:12:04

解决方案2
1 2020-12-29 16:24:23

如何为 catboost 创建自定义评估指标？

问题描述

题

使用sklearn

使用 catboost

题

更新

2 个解决方案

解决方案1 4 已采纳 2020-12-29 22:12:04

解决方案2 1 2020-12-29 16:24:23

解决方案1
4 已采纳 2020-12-29 22:12:04

解决方案2
1 2020-12-29 16:24:23