简体   繁体   English

提高多项逻辑回归的准确性 model 从头开始构建

[英]Improving accuracy of multinomial logistic regression model built from scratch

I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:我目前正在使用 numpy 创建一个多 class 分类器,最后使用 softmax 得到一个工作 model 如下:

class MultinomialLogReg:
    def fit(self, X, y, lr=0.00001, epochs=1000):
        self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
        self.y = y
        self.classes = np.unique(y)
        self.theta = np.zeros((len(self.classes), self.X.shape[1]))
        self.o_h_y = self.one_hot(y)
        
        for e in range(epochs):
            preds = self.probs(self.X)

            l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
            
            if e%10000 == 0:
                print("epoch: ", e, "loss: ", l)
            
            self.theta -= (lr*grad)
        
        return self
    
    def norm_x(self, X):
        for i in range(X.shape[0]):
            mn = np.amin(X[i])
            mx = np.amax(X[i])
            X[i] = (X[i] - mn)/(mx-mn)
        return X
    
    def one_hot(self, y):
        Y = np.zeros((y.shape[0], len(self.classes)))
        for i in range(Y.shape[0]):
            to_put = [0]*len(self.classes)
            to_put[y[i]] = 1
            Y[i] = to_put
        return Y
    
    def probs(self, X):
        return self.softmax(np.dot(X, self.theta.T))
    
    def get_loss(self, w,x,y,preds):
        m = x.shape[0]
        
        loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
        
        grad = (1 / m) * (np.dot((preds - y).T, x)) #And compute the gradient for that loss
        
        return loss,grad

    def softmax(self, z):
        return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
    
    def predict(self, X):
        X = np.insert(X, 0, 1, axis=1)
        return np.argmax(self.probs(X), axis=1)
        #return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
        
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

And had several questions:并且有几个问题:

  1. Is this a correct mutlinomial logistic regression implementation?这是一个正确的多项式逻辑回归实现吗?

  2. It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set.使用 0.1 的学习率需要 100,000 个 epoch 才能使损失达到 1 - 0.5 并在测试集上获得 70 - 90% 的准确度。 Would this be considered bad performance?这会被认为表现不佳吗?

  3. What are some ways for improving performance or speeding up training (to need less epochs)?有哪些方法可以提高性能或加快训练(需要更少的 epoch)?

  4. I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:我在网上看到了这个成本 function ,它提供了更好的精度,它看起来像交叉熵,但它与我看到的交叉熵优化方程不同,有人可以解释两者的不同之处:

error = preds - self.o_h_y
grad = np.dot(error.T, self.X)
self.theta -= (lr*grad)
  1. This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.这看起来不错,但我认为您在适合 function 中执行的预处理应该在 model 之外完成。
  2. It's hard to know whether this is good or bad.很难知道这是好是坏。 While the loss landscape is convex, the time it takes to obtain a minima varies for different problems.虽然损失情况是凸的,但获得最小值所需的时间因不同问题而异。 One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima.确保您获得最佳解决方案的一种方法是添加一个阈值来测试梯度范数的大小,当您接近最优值时,该阈值很小。 Something like np.linalg.norm(grad) < 1e-8 .np.linalg.norm(grad) < 1e-8这样的东西。
  3. You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS.您可以使用更好的优化器,例如牛顿法,或准牛顿法,例如 LBFGS。 I would start with Newton's method as it's easier to implement.我会从牛顿的方法开始,因为它更容易实现。 LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method. LBFGS 是一种非平凡算法,它近似于执行牛顿法所需的 Hessian。
  4. It's the same;一样的; the gradients aren't being averaged.梯度没有被平均。 Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways.由于您正在执行梯度下降,因此平均是一个可以忽略的常数,因为无论如何都需要适当调整的学习率。 In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.一般来说,我认为平均可以更容易地在同一数据集的不同拆分上获得稳定的学习率。

A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?给你一个问题:当你评估你的测试集时,你是否像在你的 fit function 中对训练集进行预处理一样?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM