[英]Appropriate Deep Learning Structure for multi-class classification
I have the following data 我有以下数据
feat_1 feat_2 ... feat_n label
gene_1 100.33 10.2 ... 90.23 great
gene_2 13.32 87.9 ... 77.18 soso
....
gene_m 213.32 63.2 ... 12.23 quitegood
The size of M
is large ~30K rows, and N
is much smaller ~10 columns. M
的大小约为30K行, N
小得多~10列。 My question is what is the appropriate Deep Learning structure to learn and test the data like above. 我的问题是,学习和测试上述数据的适当深度学习结构是什么。
At the end of the day, the user will give a vector of genes with expression. 在一天结束时,用户将给出具有表达的基因载体。
gene_1 989.00
gene_2 77.10
...
gene_N 100.10
And the system will label which label does each gene apply eg great or soso, etc... 并且系统将标记每个基因适用的标签,例如伟大或soso等...
By structure I mean one of these: 按结构我的意思是其中之一:
To expand a little on @sung-kim 's comment: 关于@ sung-kim的评论,请稍微扩展一下:
As with all modelling problems though, I would suggest starting from the most basic model to look for signal. 与所有建模问题一样,我建议从最基本的模型开始寻找信号。 Perhaps a good place to start is Logistic Regression before you worry about deep learning.
在你担心深度学习之前,也许一个好的起点是Logistic回归 。
If you have got to the point where you want to try deep learning, for whatever reasons. 无论出于何种原因,如果你已经到了想要深度学习的地步。 Then for this type of data a basic feed-forward network is the best place to start.
然后,对于这种类型的数据,基本的前馈网络是最佳起点。 In terms of deep-learning, 30k data points is not a large number, so always best start out with a small network (1-3 hidden layers, 5-10 neurons) and then get bigger.
在深度学习方面,30k数据点不是很大,所以总是最好从一个小网络(1-3个隐藏层,5-10个神经元)开始,然后变大。 Make sure you have a decent validation set when performing parameter optimisation though.
确保在执行参数优化时有一个合适的验证集。 If your a fan of the
scikit-learn
API, I suggest that Keras is a good place to start 如果您是
scikit-learn
API的粉丝,我建议Keras是一个很好的起点
One further comment, you will want to use a OneHotEncoder on your class labels before you do any training. 还有一条评论,您需要在进行任何培训之前在类标签上使用OneHotEncoder 。
EDIT 编辑
I see from the bounty and the comments that you want to see a bit more about how these networks work. 我从赏金和评论中看到,您希望更多地了解这些网络的工作原理。 Please see the example of how to build a feed-forward model and do some simple parameter optisation
请参阅如何构建前馈模型并执行一些简单的参数优化的示例
import numpy as np
from sklearn import preprocessing
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
# Create some random data
np.random.seed(42)
X = np.random.random((10, 50))
# Similar labels
labels = ['good', 'bad', 'soso', 'amazeballs', 'good']
labels += labels
labels = np.array(labels)
np.random.shuffle(labels)
# Change the labels to the required format
numericalLabels = preprocessing.LabelEncoder().fit_transform(labels)
numericalLabels = numericalLabels.reshape(-1, 1)
y = preprocessing.OneHotEncoder(sparse=False).fit_transform(numericalLabels)
# Simple Keras model builder
def buildModel(nFeatures, nClasses, nLayers=3, nNeurons=10, dropout=0.2):
model = Sequential()
model.add(Dense(nNeurons, input_dim=nFeatures))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
for i in xrange(nLayers-1):
model.add(Dense(nNeurons))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
model.add(Dense(nClasses))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model
# Do an exhaustive search over a given parameter space
for nLayers in xrange(2, 4):
for nNeurons in xrange(5, 8):
model = buildModel(X.shape[1], y.shape[1], nLayers, nNeurons)
modelHist = model.fit(X, y, batch_size=32, nb_epoch=10,
validation_split=0.3, shuffle=True, verbose=0)
minLoss = min(modelHist.history['val_loss'])
epochNum = modelHist.history['val_loss'].index(minLoss)
print '{0} layers, {1} neurons best validation at'.format(nLayers, nNeurons),
print 'epoch {0} loss = {1:.2f}'.format(epochNum, minLoss)
Which outputs 哪个输出
2 layers, 5 neurons best validation at epoch 0 loss = 1.18
2 layers, 6 neurons best validation at epoch 0 loss = 1.21
2 layers, 7 neurons best validation at epoch 8 loss = 1.49
3 layers, 5 neurons best validation at epoch 9 loss = 1.83
3 layers, 6 neurons best validation at epoch 9 loss = 1.91
3 layers, 7 neurons best validation at epoch 9 loss = 1.65
Deep learning structure would be recommended if you were dealing with raw data and wanted to find features, that work towards your classification goal, automatically. 如果您正在处理原始数据并希望自动找到适合您的分类目标的功能,则建议使用深度学习结构。 But based on the names of your columns and their number (only 10) it seems that you have your features already engineered.
但根据您的列名称及其编号(仅10个),您似乎已经设计了您的功能。
For this reason you could just go with a standard multi-layer neural network and use supervised learning (back propagation). 因此,您可以使用标准的多层神经网络并使用监督学习(反向传播)。 Such network would have the number of inputs matching the number of your columns (10), followed by a number of hidden layers, and then followed by an output layer with the number of neurons matching the number of your labels.
这样的网络将具有与列数(10)匹配的输入数量,其后是多个隐藏层,然后是输出层,其中神经元的数量与您的标签数量相匹配。 You could experiment with using different number of hidden layers, neurons, different neuron types (sigmoid, tanh, rectified linear etc.) and so on.
您可以尝试使用不同数量的隐藏层,神经元,不同的神经元类型(S形,tanh,矫正线性等)等。
Alternatively you could use the raw data (if it's available) and then go with DBNs (they're known to be robust and achieve good results across different problems) or auto-encoders. 或者,你可以使用原始数据(如果它可用),然后使用DBN(它们已知是健壮的并且可以在不同的问题上获得良好的结果)或自动编码器。
If you expect the output to be thought of like scores for a label (as I understood from your question), try a supervised multi-class logistic regression classifier. 如果您希望输出被认为是标签的分数(正如我从您的问题中所理解的那样),请尝试使用受监督的多类逻辑回归分类器。 (the highest score takes the label).
(得分最高的是标签)。
If you're bound to use deep-learning. 如果你一定要使用深度学习。
A simple feed-forward ANN should do, supervise learning through back propagation. 一个简单的前馈ANN应该做,通过反向传播监督学习。 Input layer with N neurons, and one or two hidden layers can be added, not more than that.
可以添加具有N个神经元的输入层,并且可以添加一个或两个隐藏层。 There is no need to go ' deep ' and add more layers for this data, there is risk to overfit the data easily with more layers , if you do so it can be tricky to figure out what the problem is, and the test accuracy will be affected greatly.
没有必要“ 深入 ”并为这些数据添加更多层, 存在使用更多层容易过度填充数据的风险 ,如果这样做,找出问题所在并且测试准确性将会很棘手受到很大影响。
Simply plotting or visualizing the data ie with t-sne can be a good start, if you need to figure out which features are important (or any correlation that may exist). 如果您需要确定哪些特征是重要的(或任何可能存在的相关性),简单地绘制或可视化数据(即使用t-sne)可能是一个良好的开端。
you can then play with higher powers of those feature dimensions/ or add increased weight to their score. 然后,您可以使用这些要素维度的更高权力/或增加其分数的权重。
For problems like this, deep-learning probably isn't very well suited. 对于这样的问题,深度学习可能不太适合。 but a simpler ANN architecture like this should work well depending on the data.
但是这样一个简单的ANN架构应该可以很好地运行,具体取决于数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.