简体   繁体   English

适用于多级分类的适当深度学习结构

[英]Appropriate Deep Learning Structure for multi-class classification

I have the following data 我有以下数据

         feat_1    feat_2 ... feat_n   label
gene_1   100.33     10.2  ... 90.23    great
gene_2   13.32      87.9  ... 77.18    soso
....
gene_m   213.32     63.2  ... 12.23    quitegood

The size of M is large ~30K rows, and N is much smaller ~10 columns. M的大小约为30K行, N小得多~10列。 My question is what is the appropriate Deep Learning structure to learn and test the data like above. 我的问题是,学习和测试上述数据的适当深度学习结构是什么。

At the end of the day, the user will give a vector of genes with expression. 在一天结束时,用户将给出具有表达的基因载体。

gene_1   989.00
gene_2   77.10
...
gene_N   100.10

And the system will label which label does each gene apply eg great or soso, etc... 并且系统将标记每个基因适用的标签,例如伟大或soso等...

By structure I mean one of these: 按结构我的意思是其中之一:

  • Convolutional Neural Network (CNN) 卷积神经网络(CNN)
  • Autoencoder 自动编码器
  • Deep Belief Network (DBN) 深信仰网络(DBN)
  • Restricted Boltzman Machine 受限制的玻尔兹曼机器

To expand a little on @sung-kim 's comment: 关于@ sung-kim的评论,请稍微扩展一下:

  • CNN's are used primarily for problems in computer imaging, such as classifying images. CNN主要用于计算机成像中的问题,例如分类图像。 They are modelled on animals visual cortex, they basically have a connection network such that there are tiles of features which have some overlap. 它们以动物视觉皮层为模型,它们基本上具有连接网络,使得存在具有一些重叠的特征的瓦片。 Typically they require a lot of data, more than 30k examples. 通常他们需要大量数据,超过30k的例子。
  • Autoencoder's are used for feature generation and dimensionality reduction. Autoencoder用于特征生成和降维。 They start with lots of neurons on each layer, then this number is reduced, and then increased again. 它们从每层上的大量神经元开始,然后这个数字减少,然后再次增加。 Each object is trained on itself. 每个对象都经过自己的训练。 This results in the middle layers (low number of neurons) providing a meaningful projection of the feature space in a low dimension. 这导致中间层(低数量的神经元)在低维度上提供特征空间的有意义的投影。
  • While I don't know much about DBN's they appear to be a supervised extension of the Autoencoder. 虽然我对DBN不太了解,但它们似乎是Autoencoder的监督扩展。 Lots of parameters to train. 训练的参数很多。
  • Again I don't know much about Boltzmann machines, but they aren't widely used for this sort of problem (to my knowledge) 我不太了解Boltzmann机器,但它们并没有被广泛用于这类问题(据我所知)

As with all modelling problems though, I would suggest starting from the most basic model to look for signal. 与所有建模问题一样,我建议从最基本的模型开始寻找信号。 Perhaps a good place to start is Logistic Regression before you worry about deep learning. 在你担心深度学习之前,也许一个好的起点是Logistic回归

If you have got to the point where you want to try deep learning, for whatever reasons. 无论出于何种原因,如果你已经到了想要深度学习的地步。 Then for this type of data a basic feed-forward network is the best place to start. 然后,对于这种类型的数据,基本的前馈网络是最佳起点。 In terms of deep-learning, 30k data points is not a large number, so always best start out with a small network (1-3 hidden layers, 5-10 neurons) and then get bigger. 在深度学习方面,30k数据点不是很大,所以总是最好从一个小网络(1-3个隐藏层,5-10个神经元)开始,然后变大。 Make sure you have a decent validation set when performing parameter optimisation though. 确保在执行参数优化时有一个合适的验证集。 If your a fan of the scikit-learn API, I suggest that Keras is a good place to start 如果您是scikit-learn API的粉丝,我建议Keras是一个很好的起点

One further comment, you will want to use a OneHotEncoder on your class labels before you do any training. 还有一条评论,您需要在进行任何培训之前在类标签上使用OneHotEncoder

EDIT 编辑

I see from the bounty and the comments that you want to see a bit more about how these networks work. 我从赏金和评论中看到,您希望更多地了解这些网络的工作原理。 Please see the example of how to build a feed-forward model and do some simple parameter optisation 请参阅如何构建前馈模型并执行一些简单的参数优化的示例

import numpy as np
from sklearn import preprocessing
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout

# Create some random data
np.random.seed(42)
X = np.random.random((10, 50))

# Similar labels
labels = ['good', 'bad', 'soso', 'amazeballs', 'good']
labels += labels
labels = np.array(labels)
np.random.shuffle(labels)

# Change the labels to the required format
numericalLabels = preprocessing.LabelEncoder().fit_transform(labels)
numericalLabels = numericalLabels.reshape(-1, 1)
y = preprocessing.OneHotEncoder(sparse=False).fit_transform(numericalLabels)

# Simple Keras model builder
def buildModel(nFeatures, nClasses, nLayers=3, nNeurons=10, dropout=0.2):
    model = Sequential()
    model.add(Dense(nNeurons, input_dim=nFeatures))
    model.add(Activation('sigmoid'))
    model.add(Dropout(dropout))
    for i in xrange(nLayers-1):
        model.add(Dense(nNeurons))
        model.add(Activation('sigmoid'))
        model.add(Dropout(dropout))
    model.add(Dense(nClasses))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='sgd')

    return model

# Do an exhaustive search over a given parameter space
for nLayers in xrange(2, 4):
    for nNeurons in xrange(5, 8):
        model = buildModel(X.shape[1], y.shape[1], nLayers, nNeurons)
        modelHist = model.fit(X, y, batch_size=32, nb_epoch=10,
                              validation_split=0.3, shuffle=True, verbose=0)
        minLoss = min(modelHist.history['val_loss'])
        epochNum = modelHist.history['val_loss'].index(minLoss)
        print '{0} layers, {1} neurons best validation at'.format(nLayers, nNeurons),
        print 'epoch {0} loss = {1:.2f}'.format(epochNum, minLoss)

Which outputs 哪个输出

2 layers, 5 neurons best validation at epoch 0 loss = 1.18
2 layers, 6 neurons best validation at epoch 0 loss = 1.21
2 layers, 7 neurons best validation at epoch 8 loss = 1.49
3 layers, 5 neurons best validation at epoch 9 loss = 1.83
3 layers, 6 neurons best validation at epoch 9 loss = 1.91
3 layers, 7 neurons best validation at epoch 9 loss = 1.65

Deep learning structure would be recommended if you were dealing with raw data and wanted to find features, that work towards your classification goal, automatically. 如果您正在处理原始数据并希望自动找到适合您的分类目标的功能,则建议使用深度学习结构。 But based on the names of your columns and their number (only 10) it seems that you have your features already engineered. 但根据您的列名称及其编号(仅10个),您似乎已经设计了您的功能。

For this reason you could just go with a standard multi-layer neural network and use supervised learning (back propagation). 因此,您可以使用标准的多层神经网络并使用监督学习(反向传播)。 Such network would have the number of inputs matching the number of your columns (10), followed by a number of hidden layers, and then followed by an output layer with the number of neurons matching the number of your labels. 这样的网络将具有与列数(10)匹配的输入数量,其后是多个隐藏层,然后是输出层,其中神经元的数量与您的标签数量相匹配。 You could experiment with using different number of hidden layers, neurons, different neuron types (sigmoid, tanh, rectified linear etc.) and so on. 您可以尝试使用不同数量的隐藏层,神经元,不同的神经元类型(S形,tanh,矫正线性等)等。

Alternatively you could use the raw data (if it's available) and then go with DBNs (they're known to be robust and achieve good results across different problems) or auto-encoders. 或者,你可以使用原始数据(如果它可用),然后使用DBN(它们已知是健壮的并且可以在不同的问题上获得良好的结果)或自动编码器。

If you expect the output to be thought of like scores for a label (as I understood from your question), try a supervised multi-class logistic regression classifier. 如果您希望输出被认为是标签的分数(正如我从您的问题中所理解的那样),请尝试使用受监督的多类逻辑回归分类器。 (the highest score takes the label). (得分最高的是标签)。

If you're bound to use deep-learning. 如果你一定要使用深度学习。

A simple feed-forward ANN should do, supervise learning through back propagation. 一个简单的前馈ANN应该做,通过反向传播监督学习。 Input layer with N neurons, and one or two hidden layers can be added, not more than that. 可以添加具有N个神经元的输入层,并且可以添加一个或两个隐藏层。 There is no need to go ' deep ' and add more layers for this data, there is risk to overfit the data easily with more layers , if you do so it can be tricky to figure out what the problem is, and the test accuracy will be affected greatly. 没有必要“ 深入 ”并为这些数据添加更多层, 存在使用更多层容易过度填充数据的风险 ,如果这样做,找出问题所在并且测试准确性将会很棘手受到很大影响。

Simply plotting or visualizing the data ie with t-sne can be a good start, if you need to figure out which features are important (or any correlation that may exist). 如果您需要确定哪些特征是重要的(或任何可能存在的相关性),简单地绘制或可视化数据(即使用t-sne)可能是一个良好的开端。

you can then play with higher powers of those feature dimensions/ or add increased weight to their score. 然后,您可以使用这些要素维度的更高权力/或增加其分数的权重。

For problems like this, deep-learning probably isn't very well suited. 对于这样的问题,深度学习可能不太适合。 but a simpler ANN architecture like this should work well depending on the data. 但是这样一个简单的ANN架构应该可以很好地运行,具体取决于数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM