简体   繁体   English

VotingClassifier:不同的功能集

[英]VotingClassifier: Different Feature Sets

I have two different feature sets (so, with same number of rows and the labels are the same), in my case DataFrames : 我有两个不同的功能集(因此,行数相同且标签相同),在我的案例中是DataFrames

df1 : df1

| A | B | C |
-------------
| 1 | 4 | 2 |
| 1 | 4 | 8 |
| 2 | 1 | 1 |
| 2 | 3 | 0 |
| 3 | 2 | 5 |

df2 : df2

| E | F |
---------
| 6 | 1 |
| 1 | 3 |
| 8 | 1 |
| 2 | 8 |
| 5 | 2 |

labels : labels

| labels |
----------
|    5   |
|    5   |
|    1   |
|    7   |
|    3   |

I want to use them to train a VotingClassifier . 我想用它们训练一个VotingClassifier But the fitting step only allows to specify a single feature set. 但是拟合步骤仅允许指定单个特征集。 Goal is to fit clf1 with df1 and clf2 with df2 . 目标是将clf1df1clf2df2拟合。

eclf = VotingClassifier(estimators=[('df1-clf', clf1), ('df2-clf', clf2)], voting='soft')
eclf.fit(...)

How should I proceed with this kind of situation? 我该如何处理这种情况? Is there any easy solution? 有没有简单的解决方案?

Its pretty easy to make custom functions to do what you want to achieve. 很容易使自定义函数完成您想要实现的目标。

Import the prerequisites: 导入先决条件:

import numpy as np
from sklearn.preprocessing import LabelEncoder

def fit_multiple_estimators(classifiers, X_list, y, sample_weights = None):

    # Convert the labels `y` using LabelEncoder, because the predict method is using index-based pointers
    # which will be converted back to original data later.
    le_ = LabelEncoder()
    le_.fit(y)
    transformed_y = le_.transform(y)

    # Fit all estimators with their respective feature arrays
    estimators_ = [clf.fit(X, y) if sample_weights is None else clf.fit(X, y, sample_weights) for clf, X in zip([clf for _, clf in classifiers], X_list)]

    return estimators_, le_


def predict_from_multiple_estimator(estimators, label_encoder, X_list, weights = None):

    # Predict 'soft' voting with probabilities

    pred1 = np.asarray([clf.predict_proba(X) for clf, X in zip(estimators, X_list)])
    pred2 = np.average(pred1, axis=0, weights=weights)
    pred = np.argmax(pred2, axis=1)

    # Convert integer predictions to original labels:
    return label_encoder.inverse_transform(pred)

The logic is taken from VotingClassifier source . 逻辑取自VotingClassifier源

Now test the above methods. 现在测试上面的方法。 First get some data: 首先得到一些数据:

from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = []

#Convert int classes to string labels
for x in data.target:
    if x==0:
        y.append('setosa')
    elif x==1:
        y.append('versicolor')
    else:
        y.append('virginica')

Split the data into train and test: 将数据拆分为火车和测试:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Divide the X into different feature datas: 将X划分为不同的要素数据:

X_train1, X_train2 = X_train[:,:2], X_train[:,2:]
X_test1, X_test2 = X_test[:,:2], X_test[:,2:]

X_train_list = [X_train1, X_train2]
X_test_list = [X_test1, X_test2]

Get list of classifiers: 获取分类器列表:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Make sure the number of estimators here are equal to number of different feature datas
classifiers = [('knn',  KNeighborsClassifier(3)),
    ('svc', SVC(kernel="linear", C=0.025, probability=True))]

Fit the classifiers with the data: 使用数据拟合分类器:

fitted_estimators, label_encoder = fit_multiple_estimators(classifiers, X_train_list, y_train)

Predict using the test data: 预测使用测试数据:

y_pred = predict_from_multiple_estimator(fitted_estimators, label_encoder, X_test_list)

Get accuracy of predictions: 获得预测的准确性:

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Feel free to ask if any doubt. 如果有任何疑问,请随时询问。

To use as much as sklearn tools as possible, I find following way more appealing. 为了尽可能多地使用sklearn工具,我发现以下方式更具吸引力。

from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier

######################
# custom transformer for sklearn pipeline
class ColumnExtractor(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

######################
# processing data
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

######################
# fit clf1 with df1
pipe1 = Pipeline([
    ('col_extract', ColumnExtractor( cols=range(0,2) )), # selecting features 0 and 1 (df1) to be used with LR (clf1)
    ('clf', LogisticRegression())
    ])

pipe1.fit(X_train, y_train) # sanity check
pipe1.score(X_test,y_test) # sanity check
# output: 0.6842105263157895

######################
# fit clf2 with df2
pipe2 = Pipeline([
    ('col_extract', ColumnExtractor( cols=range(2,4) )), # selecting features 2 and 3 (df2) to be used with SVC (clf2)
    ('clf', SVC(probability=True))
    ])

pipe2.fit(X_train, y_train) # sanity check
pipe2.score(X_test,y_test) # sanity check
# output: 0.9736842105263158

######################
# ensemble/voting classifier where clf1 fitted with df1 and clf2 fitted with df2
eclf = VotingClassifier(estimators=[('df1-clf1', pipe1), ('df2-clf2', pipe2)], voting='soft', weights= [1, 0.5])
eclf.fit(X_train, y_train)
eclf.score(X_test,y_test)
# output: 0.9473684210526315

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM