为什么我的AdaBoost实现的错误没有下降？

Question

I'm trying to implement Adaboost M1 in Python from this pseudocode: 我试图用这个伪代码在Python中实现Adaboost M1：

I have gotten some way, however, my the amount of "wrong predictions" is not declining. 我已经走了一些路，然而，我的“错误预测”数量并没有下降。

I have checked my weight-updating function, and it seems to be updating the weights correctly. 我检查了我的体重更新功能，似乎正确地更新了体重。

The error might be in the classifier, since the number of "incorrect predictions" is the same integer every other iteration- I have tried 100 iterations. 错误可能在分类器中，因为“不正确的预测”的数量是每隔一次迭代的相同整数 - 我已经尝试了100次迭代。 I have no idea why it is not giving less incorrect per iteration. 我不知道为什么每次迭代都没有给出不正确的错误。

A tip would be greatly appreciated. 小费将非常感激。 Thanks:) 谢谢：）

from sklearn import tree
import pandas as pd
import numpy as np
import math

df = pd.read_csv("./dataset(3)/adaboost_train.csv")
X_train = df.loc[:,'x1':'x10']
Y_train = df[['y']]



def adaBoost(X_train,Y_train):
    classifiers = []
    # initializing the weights:
    N = len(Y_train)
    w_i = [1 / N] * N

    T = 20
    x_train = (X_train.apply(lambda x: x.tolist(), axis=1))
    clf_errors = []

    for t in range(T):
        print("Iteration:", t)
        # clf = clf2.fit(X_train,Y_train, sample_weight = w_i)

        clf = tree.DecisionTreeClassifier(max_depth=1)
        clf.fit(X_train, Y_train, sample_weight = w_i)

        #Predict all the values:
        y_pred = []
        for sample in x_train:
            p = clf.predict([sample])
            p = p[0]
            y_pred.append(p)
        num_of_incorrect = calculate_error_clf(y_pred, Y_train)


        clf_errors.append(num_of_incorrect)

        error_internal = calc_error(w_i,Y_train,y_pred)

        alpha = np.log((1-error_internal)/ error_internal)
        print(alpha)

        # Add the predictions, error and alpha for later use for every iteration
        classifiers.append((y_pred, error_internal, alpha))

        if t == 2 and y_pred == classifiers[0][0]:
            print("TRUE")


        w_i = update_weights(w_i,y_pred,Y_train,alpha,clf)


def calc_error(weights,Y_train,y_pred):
    err = 0
    for i in range(len(weights)):
        if y_pred[i] != Y_train['y'].iloc[i]:
            err= err + weights[i]
    # Normalizing the error:
    err = err/np.sum(weights)
    return err

# If the prediction is true, return 0. If it is not true, return 1.
def check_pred(y_p, y_t):
    if y_p == y_t:
        return 0
    else:
        return 1

def update_weights(w,y_pred,Y_train,alpha,clf):
    for j in range(len(w)):
        if y_pred[j] != Y_train['y'].iloc[j]:
            w[j] = w[j]* (np.exp( alpha * 1))
    return w

def calculate_error_clf(y_pred, y):
    sum_error = 0
    for i in range(len(y)):
        if y_pred[i] != y.iloc[i]['y']:
            sum_error += 1
        e = (y_pred[i] - y.iloc[i]['y'])**2


        #sum_error += e
    sum_error = sum_error
    return sum_error

I am expecting the error to go down, but it is not. 我期待错误消失，但事实并非如此。 For example: 例如：

iteration 1: num_of_incorrect 4444
iteration 2: num_of_incorrect 4762
iteration 3: num_of_incorrect 4353
iteration 4: num_of_incorrect 4762
iteration 5: num_of_incorrect 4450
iteration 6: num_of_incorrect 4762
...
does not converge

Answer 1

The number of misclassifications will NOT go down with each iteration (since each classifier is a week classifier). 每次迭代时，错误分类的数量不会下降（因为每个分类器是一周分类器）。 It is an ensemble model where it gives more weight to the previously misclassified sample. 它是一个集合模型，它为以前错误分类的样本提供了更多的权重。 So in the next iteration some of the the previously misclassified sampled will be correctly classified but this might also result in previously correctly classified samples going wrong (hence iteration level error is not improving). 因此，在下一次迭代中，一些先前错误分类的采样将被正确分类，但这也可能导致先前正确分类的样本出错（因此迭代级别错误没有改善）。 Even though each classifier is weak, since the final output is the weighted sum of all the classifiers the final classification converge to a strong learner (see line 3 of the algorithm). 尽管每个分类器都很弱，但由于最终输出是所有分类器的加权和，因此最终分类会收敛到强学习者（参见算法的第3行）。

My Implementation using numpy 我的实现使用numpy

from sklearn import tree
import pandas as pd
import numpy as np
import math
from sklearn.datasets import load_breast_cancer, classification_report
from sklearn.metrics import confusion_matrix

data = load_breast_cancer()
X_train = data.data
Y_train = np.where(data.target == 0, 1, -1)

def adaBoost(X_train,Y_train):
    classifiers = []
    # initializing the weights:
    N = len(Y_train)
    w_i = np.array([1 / N] * N)

    T = 20
    clf_errors = []

    for t in range(T):
        clf = tree.DecisionTreeClassifier(max_depth=1)
        clf.fit(X_train, Y_train, sample_weight = w_i)

        #Predict all the values:
        y_pred = clf.predict(X_train)   
        #print (confusion_matrix(Y_train, y_pred))

        # Line 2(b) of algorithm 
        error = np.sum(np.where(Y_train != y_pred, w_i, 0))/np.sum(w_i)
        print("Iteration: {0}, Missed: {1}".format(t, np.sum(np.where(Y_train != y_pred, 1, 0))))

        # Line 2(c) of algorithm 
        alpha = np.log((1-error)/ error)
        classifiers.append((alpha, clf))
        # Line 2(d) of algorithm 
        w_i = np.where(Y_train != y_pred, w_i*np.exp(alpha), w_i)
    return classifiers

clfs = adaBoost(X_train, Y_train)

# Line 3 of algorithm 
def predict(clfs, x):
    s = np.zeros(len(x))
    for (alpha, clf) in clfs:
        s += alpha*clf.predict(x)
    return np.sign(s)

print (confusion_matrix(Y_train, predict(clfs,X_train)))

Output : 输出：

Iteration: 0, Missed: 44 Iteration: 1, Missed: 48 Iteration: 2, Missed: 182 Iteration: 3, Missed: 73 Iteration: 4, Missed: 102 Iteration: 5, Missed: 160 Iteration: 6, Missed: 185 Iteration: 7, Missed: 69 Iteration: 8, Missed: 357 Iteration: 9, Missed: 127 Iteration: 10, Missed: 256 Iteration: 11, Missed: 160 Iteration: 12, Missed: 298 Iteration: 13, Missed: 64 Iteration: 14, Missed: 221 Iteration: 15, Missed: 113 Iteration: 16, Missed: 261 Iteration: 17, Missed: 368 Iteration: 18, Missed: 49 Iteration: 19, Missed: 171 [[354 3] [ 3 209]]

precision recall f1-score support -1 0.99 0.99 0.99 357 1 0.99 0.99 0.99 212 avg / total 0.99 0.99 0.99 569

As you can see, the no:of misses will not improve, however if you check the confusion matrix (uncomment in the code) you will see that some of the previously misclassified samples will be correctly classified. 正如您所看到的，未命中的No：将不会改善，但是如果您检查混淆矩阵（在代码中取消注释），您将看到一些先前错误分类的样本将被正确分类。 Finally, for predictions since we weight the classifiers by the error, the weight sum converges to a strong classifier (as seen in the final predictions made). 最后，对于预测，因为我们通过误差对分类器进行加权，权重和会聚到强分类器（如最终预测所示）。

为什么我的AdaBoost实现的错误没有下降？

问题描述

1 个解决方案

解决方案1
3 2019-03-23 22:18:38

My Implementation using numpy 我的实现使用numpy

为什么我的AdaBoost实现的错误没有下降？

问题描述

1 个解决方案

解决方案1 3 2019-03-23 22:18:38

My Implementation using numpy 我的实现使用numpy

解决方案1
3 2019-03-23 22:18:38