简体   繁体   English

具有成人收入数据集的神经网络训练精度低

[英]low training accuracy of a neural network with adult income dataset

I built a neural network with tensorflow.我用张量流构建了一个神经网络。 It is a simple 3 layer neural network with the last layer being softmax.它是一个简单的 3 层神经网络,最后一层是 softmax。

I tried it on standard adult income dataset (eg https://archive.ics.uci.edu/ml/datasets/adult ) since it is publicly available, has a good amount of data (roughly 50k examples) and also provides separate test data.我在标准成人收入数据集(例如https://archive.ics.uci.edu/ml/datasets/adult )上进行了尝试,因为它是公开可用的,有大量数据(大约 5 万个示例)并且还提供单独的测试数据。

As there are some categorical attributes, I converted them into one hot encodings.由于有一些分类属性,我将它们转换为一种热编码。 For neural network I used Xavier initialization and Adam Optimizer.对于神经网络,我使用了 Xavier 初始化和 Adam Optimizer。 As there are only two output classes (>50k and <=50k) the last softmax layer had only two neurons.由于只有两个输出类(>50k 和 <=50k),最后一个 softmax 层只有两个神经元。 After one hot encoding expansion, the 14 attributes / columns expanded into 108 columns.经过一次热编码扩展,14个属性/列扩展为108列。

I experimented with different number of neurons in the first two hidden layers (from 5 to 25).我在前两个隐藏层(从 5 到 25)中尝试了不同数量的神经元。 I also experimented with number of iterations (from 1000 to 20000).我还尝试了迭代次数(从 1000 到 20000)。

The training accuracy wasn't affected much by the number of neurons.训练精度受神经元数量的影响不大。 It went up a little with more number of iterations.随着迭代次数的增加,它略有上升。 However I could not do any better than 82% :(但是我不能做得比 82% 更好:(

Am I missing something basic in my approach?我的方法中是否缺少一些基本的东西? Has anyone tried this (neural network with this dataset)?有没有人试过这个(带有这个数据集的神经网络)? If so what are the expected results?如果是这样,预期的结果是什么? Could the low accuracy be due to missing values?低准确度可能是由于缺失值吗? (I am planning to try filtering out all the missing values if there aren't much in the dataset). (如果数据集中没有太多值,我打算尝试过滤掉所有缺失值)。

Any other ideas?还有其他想法吗? Here is my tensorflow neural network code in case there are any bugs in it etc.这是我的 tensorflow 神经网络代码,以防其中有任何错误等。

def create_placeholders(n_x, n_y):
    X = tf.placeholder(tf.float32, [n_x, None], name = "X")
    Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
    return X, Y

def initialize_parameters(num_features):
    tf.set_random_seed(1)                   # so that your "random" numbers match ours
    layer_one_neurons = 5
    layer_two_neurons = 5
    layer_three_neurons = 2
    W1 = tf.get_variable("W1", [layer_one_neurons,num_features], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b1 = tf.get_variable("b1", [layer_one_neurons,1], initializer = tf.zeros_initializer())
    W2 = tf.get_variable("W2", [layer_two_neurons,layer_one_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b2 = tf.get_variable("b2", [layer_two_neurons,1], initializer = tf.zeros_initializer())
    W3 = tf.get_variable("W3", [layer_three_neurons,layer_two_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b3 = tf.get_variable("b3", [layer_three_neurons,1], initializer = tf.zeros_initializer())
    parameters = {"W1": W1,
                      "b1": b1,
                      "W2": W2,
                      "b2": b2,
                      "W3": W3,
                      "b3": b3}

    return parameters

def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX

    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """

    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    W3 = parameters['W3']
    b3 = parameters['b3']

    Z1 = tf.add(tf.matmul(W1, X), b1)                                           
    A1 = tf.nn.relu(Z1)                                             
    Z2 = tf.add(tf.matmul(W2, A1), b2)                                  
    A2 = tf.nn.relu(Z2)                                         
    Z3 = tf.add(tf.matmul(W3, A2), b3)

    return Z3

def compute_cost(Z3, Y):
    """
    Computes the cost

    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3

    Returns:
    cost - Tensor of the cost function
    """

    # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
    logits = tf.transpose(Z3)
    labels = tf.transpose(Y)

    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))

    return cost

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, num_epochs = 1000, print_cost = True):
    """
    Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.

    Arguments:
    X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
    Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
    X_test -- training set, of shape (input size = 12288, number of training examples = 120)
    Y_test -- test set, of shape (output size = 6, number of test examples = 120)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    print_cost -- True to print the cost every 100 epochs

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
    tf.set_random_seed(1)                             # to keep consistent results
    seed = 3                                          # to keep consistent results
    (n_x, m) = X_train.shape                          # (n_x: input size, m : number of examples in the train set)
    n_y = Y_train.shape[0]                            # n_y : output size
    costs = []                                        # To keep track of the cost

    # Create Placeholders of shape (n_x, n_y)
    X, Y = create_placeholders(n_x, n_y)

    # Initialize parameters
    parameters = initialize_parameters(X_train.shape[0])

    # Forward propagation: Build the forward propagation in the tensorflow graph
    Z3 = forward_propagation(X, parameters)

    # Cost function: Add cost function to tensorflow graph
    cost = compute_cost(Z3, Y)

    # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
    optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)

    # Initialize all the variables
    init = tf.global_variables_initializer()

    # Start the session to compute the tensorflow graph
    with tf.Session() as sess:

        # Run the initialization
        sess.run(init)

        # Do the training loop
        for epoch in range(num_epochs):
            _ , epoch_cost = sess.run([optimizer, cost], feed_dict={X: X_train, Y: Y_train})

            # Print the cost every epoch
            if print_cost == True and epoch % 100 == 0:
                print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
            if print_cost == True and epoch % 5 == 0:
                costs.append(epoch_cost)

        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        # lets save the parameters in a variable
        parameters = sess.run(parameters)
        print ("Parameters have been trained!")

        # Calculate the correct predictions
        correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))

        # Calculate accuracy on the test set
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

        print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
        #print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))

        return parameters

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
import pandas as pd
%matplotlib inline
np.random.seed(1)

df = pd.read_csv('adult.data', header = None)
X_train_orig = df.drop(df.columns[[14]], axis=1, inplace=False)
Y_train_orig = df[[14]]
X_train = pd.get_dummies(X_train_orig) # get one hot encoding
Y_train = pd.get_dummies(Y_train_orig) # get one hot encoding
parameters = model(X_train.T, Y_train.T, None, None, num_epochs = 10000)

Any suggestions for other publicly available dataset for trying this out?对其他公开可用的数据集有什么建议吗?

I tried standard algorithms on this dataset from scikit learn with default parameters and I got following accuracies:我使用默认参数在 scikit learn 的这个数据集上尝试了标准算法,我得到了以下精度:

Random Forest:    86
SVM:              96
kNN:              83
MLP:              79

I have uploaded my iPython notebook for this at: https://github.com/sameermahajan/ClassifiersWithIncomeData/blob/master/Scikit%2BLearn%2BClassifiers.ipynb我已经为此上传了我的 iPython 笔记本: https : //github.com/sameermahajan/ClassifiersWithIncomeData/blob/master/Scikit%2BLearn%2BClassifiers.ipynb

The best accuracy is with SVM which can be expected from some explanation that can be seen from: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Interestingly SVM also took a lot of time to run, way more than any other method.最好的准确性是使用 SVM,这可以从一些解释中得到预期,可以从以下位置看到: http : //scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html有趣的是,SVM 也花了很多时间来运行,方式比任何其他方法都多。

This may not be a good problem to be solved by neural network looking at MLPClassifier accuracy above.这可能不是一个很好的问题,可以通过神经网络查看上面的 MLPClassifier 精度来解决。 My neural network wasn't that bad after all!毕竟我的神经网络并没有那么糟糕! Thanks for all the responses and your interest in this.感谢您的所有回复和您对此的兴趣。

I didn't experiment on this dataset but after looking at some papers and doing some researches, it looks like your network is doing ok.我没有在这个数据集上做实验,但是在看了一些论文并做了一些研究之后,看起来你的网络运行正常。

First is your accuracy calculed from the training set or the test set ?首先你的准确率是从训练集还是测试集计算出来的? Having both will give you a good hint of how your network is performing.两者都可以很好地提示您的网络性能。

I'm still a bit new to machine learning but I can maybe give some help :我对机器学习还是有点陌生​​,但我也许可以提供一些帮助:

By looking at the data documentation link here : https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names通过查看此处的数据文档链接: https : //archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

And this paper : https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf而这篇论文: https : //cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf

From those links 85% accuracy on training and test set looks like a good score, you are not too far.从这些链接来看,85% 的训练和测试集准确率看起来是一个不错的分数,你离得太远了。

Do you have some kind of cross-validation to look for overfitting of your network ?您是否有某种交叉验证来寻找网络的过度拟合?

I don't have your code so can't help you if this is a bug or a programming related issue, maybe sharing your code might be a good idea.我没有你的代码,所以如果这是一个错误或与编程相关的问题,我无法帮助你,也许分享你的代码可能是个好主意。

I think you would gain more accuracy by pre-processing your data a bit : There are a lot of unknowns inside your data and neural networks are very sensitive to mislabeling and bad data.我认为您可以通过对数据进行一些预处理来获得更高的准确性:您的数据中有很多未知数,神经网络对错误标记和不良数据非常敏感。

  • You should try to find and replace or remove the unknowns.您应该尝试查找和替换或删除未知数。

  • You could also try to identify the most useful features and drop the ones that are near useless.您还可以尝试识别最有用的功能并删除几乎无用的功能。

  • Feature scaling / data normalization can also be quite important for neural networks, i didn't look much into the data but maybe you can try to figure out how to scale your data between [0, 1] if its not done already.特征缩放/数据规范化对于神经网络也非常重要,我没有对数据进行太多研究,但如果尚未完成,也许您可​​以尝试弄清楚如何在 [0, 1] 之间缩放您的数据。

  • The document I linked you seems to see an upgrade in performance by adding layers up to 5 layers, did you try adding more layers ?我链接您的文档似乎通过添加多达 5 层的层来提高性能,您是否尝试添加更多层?

  • You can also add dropout if you network overfits, if you didn't already.如果您还没有网络过拟合,您也可以添加 dropout。

  • I would maybe try other networks that are generally good for those tasks like SVM (Support Vector Machine) or Logistic Regression or even Random Forest but not sure by looking at the result that those will perform better than the artificial neural network.我可能会尝试其他通常适用于这些任务的网络,例如 SVM(支持向量机)或逻辑回归甚至随机森林,但通过查看结果不确定这些网络的性能是否优于人工神经网络。

I would also take a look at those links : https://www.kaggle.com/wenruliu/adult-income-dataset/feed我也会看看这些链接: https : //www.kaggle.com/wenruliu/adult-income-dataset/feed

https://www.kaggle.com/wenruliu/income-prediction https://www.kaggle.com/wenruliu/income-prediction

In this link there are some people trying algorithms and giving tips to process the data and tackle this subject.在此链接中,有些人尝试算法并提供处理数据和解决此主题的技巧。

Hope it helped希望有帮助

Good luck, Marc.祝你好运,马克。

I think you are focusing too much in your network structure and you are forgetting that your results also depend largely on the data quality.我认为您过于关注您的网络结构,而您忘记了您的结果在很大程度上也取决于数据质量。 I have tried a quick out-of-the-shelf random forest and it gave me similar results as you got (acc = 0.8275238).我尝试了一个快速的现成随机森林,它给了我类似的结果(acc = 0.8275238)。

I suggest you do some feature engineering (the kaggle link provided by @Marc has some nice examples).我建议你做一些特征工程(@Marc 提供的 kaggle 链接有一些很好的例子)。 Decide an strategy for your NA's (look here ), group values when you have many factor levels in categorical variables (eg countries grouped into continents) or discretise continuous variables (age variable into levels as in old, mid_aged, young).为你的 NA 决定一个策略(看 这里),当你在分类变量中有许多因子水平时对值进行分组(例如,国家分组为大洲)或离散连续变量(年龄变量为级别,如 old、mid_aged、young)。

Play with your data, study your dataset and try to apply expertise to remove redundant or too narrow info.使用您的数据,研究您的数据集并尝试应用专业知识来删除冗余或过于狭窄的信息。 Once this is done, then start tweaking your model.完成后,然后开始调整您的模型。 Additionally, you can consider doing as I did: use ensemble models (which are usually fast and pretty accurate with the default values) like RF or XGB to check if the results are consistent between all your models.此外,您可以考虑像我一样做:使用像 RF 或 XGB 这样的集成模型(使用默认值通常很快且非常准确)来检查所有模型之间的结果是否一致。 Once you are sure you are in the right track, you can start tweaking structure, layers, etc. and see if you can push your results ever further.一旦你确定你在正确的轨道上,你就可以开始调整结构、层等,看看你是否可以进一步推动你的结果。

Hope this helps.希望这可以帮助。

Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM