简体   繁体   English

使用 sklearn 进行多类、多标签、序数分类

[英]Multi-class, multi-label, ordinal classification with sklearn

I was wondering how to run a multi-class, multi-label, ordinal classification with sklearn.我想知道如何使用 sklearn 运行多类、多标签、序数分类。 I want to predict a ranking of target groups, ranging from the one that is most prevalant at a certain location (1) to the one that is least prevalent (7).我想预测目标群体的排名,范围从某个位置最流行的人群 (1) 到最不流行的人群 (7)。 I don't seem to be able to get it right.我似乎无法正确处理。 Could you please help me out?你能帮帮我吗?


# Random Forest Classification

# Import
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Import dataset
dataset = pd.read_excel('alle_probs_edit.v2.xlsx')
X = dataset.iloc[:,4:-1].values
Y = dataset.iloc[:,-1].values

# Split in Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )

# Scaling the features (alle Variablen auf eine gleiche Ebene), necessary depend on the choosen method
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Creat classifier
classifier =  RandomForestClassifier(criterion = 'entropy')

# Choose some parameter combinations to try
parameters = {'bootstrap': [True, False],
 'max_depth': [50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [9, 10, 11, 12, 13],
 'n_estimators': [500,1000,1500]}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(classifier, parameters, scoring=acc_scorer, cv = 3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, Y_train)

# Set the classifier to the best combination of parameters
classifier = grid_obj.best_estimator_

# Fit the best algorithm to the data
classifier.fit(X_train, Y_train)

#Prediction the Test data
Y_pred = classifier.predict(X_test)

#Confusion Matrix
cm = pd.DataFrame(confusion_matrix(Y_test, Y_pred))

#Accuracy
accuracy1 = accuracy_score(Y_test, Y_pred)
print("Accuracy1: %.2f%%" % (accuracy1 * 100.0))

# k-Fold Class Validation
accuracy1 = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = 10)
kfold = accuracy1.mean()
accuracy1.std()

This may not be the precise answer you're looking for, this article outlines a technique as follows:这可能不是您要寻找的确切答案, 本文概述了如下技术:

We can take advantage of the ordered class value by transforming a k-class ordinal regression problem to a k-1 binary classification problem, we convert an ordinal attribute A* with ordinal value V1, V2, V3, … Vk into k-1 binary attributes, one for each of the original attribute's first k − 1 values.我们可以通过将 k 类序数回归问题转换为 k-1 二元分类问题来利用有序类值,我们将序数值为 V1、V2、V3、... Vk 的序数属性 A* 转换为 k-1 二元属性,每个原始属性的前 k - 1 个值。 The ith binary attribute represents the test A* > Vi第 i 个二进制属性表示测试 A* > Vi

Essentially, aggregate multiple binary classifiers (predict target > 1, target > 2, target > 3, target > 4) to be able to predict whether a target is 1, 2, 3, 4 or 5. The author creates an OrdinalClassifier class that stores multiple binary classifiers in a Python dictionary.本质上,聚合多个二元分类器(预测目标> 1、目标> 2、目标> 3、目标> 4)能够预测目标是1、2、3、4还是5。作者创建了一个OrdinalClassifier类在 Python 字典中存储多个二进制分类器。

class OrdinalClassifier():

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                clf.fit(X, binary_y)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {k: self.clfs[k].predict_proba(X) for k in self.clfs}
        predicted = []
        for i, y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[i][:,1])
            elif i in clfs_predict:
                # Vi = Pr(y > Vi-1) - Pr(y > Vi)
                 predicted.append(clfs_predict[i-1][:,1] - clfs_predict[i][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[i-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

    def score(self, X, y, sample_weight=None):
        _, indexed_y = np.unique(y, return_inverse=True)
        return accuracy_score(indexed_y, self.predict(X), sample_weight=sample_weight)

The technique originates in A Simple Approach to Ordinal Classification该技术起源于A Simple Approach to Ordinal Classification

Here is an example using KNN that should be tuneable in an sklearn pipeline or grid search.这是一个使用 KNN 的示例,应该在 sklearn 管道或网格搜索中进行调整。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import clone, BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import check_classification_targets

class KNeighborsOrdinalClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_neighbors=5, *, weights='uniform', 
                 algorithm='auto', leaf_size=30, p=2, 
                 metric='minkowski', metric_params=None, n_jobs=None):
        
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.algorithm = algorithm
        self.leaf_size = leaf_size
        self.p = p
        self.metric = metric
        self.metric_params = metric_params
        self.n_jobs = n_jobs
        
    def fit(self, X, y):
        X, y = check_X_y(X, y)
        check_classification_targets(y)
        
        self.clf_ = KNeighborsClassifier(**self.get_params())
        self.clfs_ = {}
        self.classes_ = np.sort(np.unique(y))
        if self.classes_.shape[0] > 2:
            for i in range(self.classes_.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.classes_[i]).astype(np.uint8)
                clf = clone(self.clf_)
                clf.fit(X, binary_y)
                self.clfs_[i] = clf
        return self
    
    def predict_proba(self, X):
        X = check_array(X)
        check_is_fitted(self, ['classes_', 'clf_', 'clfs_'])
        
        clfs_predict = {k:self.clfs_[k].predict_proba(X) for k in self.clfs_}
        predicted = []
        for i,y in enumerate(self.classes_):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[y][:,1])
            elif y in clfs_predict:
                # Vi = Pr(y > Vi-1) - Pr(y > Vi)
                 predicted.append(clfs_predict[y-1][:,1] - clfs_predict[y][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[y-1][:,1])
        return np.vstack(predicted).T
    
    def predict(self, X):
        X = check_array(X)
        check_is_fitted(self, ['classes_', 'clf_', 'clfs_'])
        
        return np.argmax(self.predict_proba(X), axis=1)

Building off both David Diaz, the white paper, and Kartik above along with others linked to on Medium and attributed in the readme, I'm working on an OrdinalClassifier that is built on the sklearn framework and which works well with sklearn pipelines, scoring, and cross validation.在 David Diaz、白皮书和上面的 Kartik 以及其他链接到 Medium 并在自述文件中归因的基础上,我正在开发一个基于 sklearn 框架的 OrdinalClassifier,它与 sklearn 管道、评分、和交叉验证。

The OC performs very well vs. standard non ordinal mc classification and gives greater control over optimizing for precision/recall on the positive class (ie. "high" in for example the diabetes disease progression of low<medium<high classes. It supports any sklearn classifier that supports pred_proba. Cross validation scores are shown on repo. OC 与标准的非序数 mc 分类相比表现非常好,并且可以更好地控制正类(即“高”在例如低<中<高类的糖尿病疾病进展中的精确度/召回率)的优化。它支持任何支持 pred_proba 的 sklearn 分类器。交叉验证分数显示在 repo 上。

OrdinalClassifer based on sklearn基于 sklearn 的 OrdinalClassifer

https://github.com/leeprevost/OrdinalClassifier https://github.com/leeprevost/OrdinalClassifier

At this time, I would not call it multi-label.在这个时候,我不会称之为多标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM