使用深度学习来预测序列的子序列

Question

I have a data that looks like this: 我有一个看起来像这样的数据：

It can be viewed here and has been included in the code below. 它可以在这里查看，并已包含在下面的代码中。 In actuality I have ~7000 samples (row), downloadable too . 实际上我有~7000个样本（行），也可以下载。

The task is given antigen, predict the corresponding epitope. 任务给予抗原，预测相应的表位。 So epitope is always an exact substring of antigen. 因此表位始终是抗原的精确子串。 This is equivalent with the Sequence to Sequence Learning . 这相当于序列到序列学习 。 Here is my code running on Recurrent Neural Network under Keras. 这是我在Keras下的Recurrent Neural Network上运行的代码。 It was modeled according the example . 它是根据例子建模的。

My question are: 我的问题是：

Can RNN, LSTM or GRU used to predict subsequence as posed above? RNN，LSTM或GRU可用于预测上面提到的子序列吗？
How can I improve the accuracy of my code? 如何提高代码的准确性？
How can I modify my code so that it can run faster? 如何修改我的代码以便它可以更快地运行？

Here is my running code which gave very bad accuracy score. 这是我的运行代码，它给出了非常差的准确度分数。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import json
import pandas as pd
from keras.models import Sequential
from keras.engine.training import slice_X
from keras.layers.core import Activation,  RepeatVector, Dense
from keras.layers import recurrent, TimeDistributed
import numpy as np
from six.moves import range

class CharacterTable(object):
    '''
    Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilties to their character output
    '''
    def __init__(self, chars, maxlen):
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
        self.maxlen = maxlen

    def encode(self, C, maxlen=None):
        maxlen = maxlen if maxlen else self.maxlen
        X = np.zeros((maxlen, len(self.chars)))
        for i, c in enumerate(C):
            X[i, self.char_indices[c]] = 1
        return X

    def decode(self, X, calc_argmax=True):
        if calc_argmax:
            X = X.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in X)

class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

INVERT = True
HIDDEN_SIZE = 128
BATCH_SIZE = 64
LAYERS = 3
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM


def main():
    """
    Epitope_core = answers
    Antigen      = questions
    """

    epi_antigen_df = pd.io.parsers.read_table("http://dpaste.com/2PZ9WH6.txt")
    antigens = epi_antigen_df["Antigen"].tolist()
    epitopes = epi_antigen_df["Epitope Core"].tolist()

    if INVERT:
        antigens = [ x[::-1] for x in antigens]

    allchars = "".join(antigens+epitopes)
    allchars = list(set(allchars))
    aa_chars =  "".join(allchars)
    sys.stderr.write(aa_chars + "\n")

    max_antigen_len = len(max(antigens, key=len))
    max_epitope_len = len(max(epitopes, key=len))

    X = np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
    y = np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)

    ctable = CharacterTable(aa_chars, max_antigen_len)

    sys.stderr.write("Begin vectorization\n")
    for i, antigen in enumerate(antigens):
        X[i] = ctable.encode(antigen, maxlen=max_antigen_len)
    for i, epitope in enumerate(epitopes):
        y[i] = ctable.encode(epitope, maxlen=max_epitope_len)


    # Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
    indices = np.arange(len(y))
    np.random.shuffle(indices)
    X = X[indices]
    y = y[indices]

    # Explicitly set apart 10% for validation data that we never train over
    split_at = len(X) - len(X) / 10
    (X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
    (y_train, y_val) = (y[:split_at], y[split_at:])

    sys.stderr.write("Build model\n")
    model = Sequential()
    # "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
    # note: in a situation where your input sequences have a variable length,
    # use input_shape=(None, nb_feature).
    model.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
    # For the decoder's input, we repeat the encoded input for each time step
    model.add(RepeatVector(max_epitope_len))
    # The decoder RNN could be multiple layers stacked or a single layer
    for _ in range(LAYERS):
        model.add(RNN(HIDDEN_SIZE, return_sequences=True))

    # For each of step of the output sequence, decide which character should be chosen
    model.add(TimeDistributed(Dense(len(aa_chars))))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

    # Train the model each generation and show predictions against the validation dataset
    for iteration in range(1, 200):
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=5,
                validation_data=(X_val, y_val))
        ###
        # Select 10 samples from the validation set at random so we can visualize errors
        for i in range(10):
            ind = np.random.randint(0, len(X_val))
            rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
            preds = model.predict_classes(rowX, verbose=0)
            q = ctable.decode(rowX[0])
            correct = ctable.decode(rowy[0])
            guess = ctable.decode(preds[0], calc_argmax=False)
            # print('Q', q[::-1] if INVERT else q)
            print('T', correct)
            print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
            print('---')

if __name__ == '__main__':
    main()

Answer 1

Can RNN, LSTM or GRU used to predict subsequence as posed above? RNN，LSTM或GRU可用于预测上面提到的子序列吗？

Yes, you can use any of these. 是的，你可以使用其中任何一个。 LSTMs and GRUs are types of RNNs; LSTM和GRU是RNN的类型; if by RNN you mean a fully-connected RNN , these have fallen out of favor because of the vanishing gradients problem ( 1 , 2 ). 如果RNN你的意思是一个完全连接的RNN ，这些已经失宠，因为消失的梯度问题（ 1 ， 2 ）。 Because of the relatively small number of examples in your dataset, a GRU might be preferable to an LSTM due to its simpler architecture. 由于数据集中的示例数量相对较少，因此GRU可能优于LSTM，因为它的架构更简单。

How can I improve the accuracy of my code? 如何提高代码的准确性？

You mentioned that training and validation error are both bad. 您提到培训和验证错误都很糟糕。 In general, this could be due to one of several factors: 一般来说，这可能是由于以下几个因素之一：

The learning rate is too low (not an issue since you're using Adam, a per-parameter adaptive learning rate algorithm) 学习速度太低（因为您使用的是Adam，一个每参数自适应学习速率算法，因此不是问题）
The model is too simple for the data (not at all the issue, since you have a very complex model and a small dataset) 该模型对于数据而言过于简单（根本不是问题，因为您有一个非常复杂的模型和一个小数据集）
You have vanishing gradients (probably the issue since you have a 3-layer RNN). 你有渐弱的渐变（可能是你有3层RNN的问题）。 Try reducing the number of layers to 1 (in general, it's good to start by getting a simple model working and then increase the complexity), and also consider hyperparameter search (eg a 128-dimensional hidden state may be too large - try 30?). 尝试将层数减少到1（通常，最好先让简单的模型工作，然后增加复杂性），并考虑超参数搜索（例如128维隐藏状态可能太大 - 尝试30？）。

Another option, since your epitope is a substring of your input, is to predict the start and end indices of the epitope within the antigen sequence (potentially normalized by the length of the antigen sequence) instead of predicting the substring one character at a time. 另一种选择，因为您的表位是您输入的子串，是预测抗原序列中表位的起始和终止指数 （可能通过抗原序列的长度标准化），而不是一次预测子字符串一个字符。 This would be a regression problem with two tasks. 这将是两个任务的回归问题。 For instance, if the antigen is FSKIAGLTVT (10 letters long) and its epitope is KIAGL (positions 3 to 7, one-based) then the input would be FSKIAGLTVT and the outputs would be 0.3 (first task) and 0.7 (second task). 例如，如果抗原是FSKIAGLTVT（10个字母长）并且其表位是KIAGL（位置3到7，一个基础）那么输入将是FSKIAGLTVT并且输出将是0.3（第一任务）和0.7（第二任务）。

Alternatively, if you can make all the antigens be the same length (by removing parts of your dataset with short antigens and/or chopping off the ends of long antigens assuming you know a priori that the epitope is not near the ends), you can frame it as a classification problem with two tasks (start and end) and sequence-length classes, where you're trying to assign a probability to the antigen starting and ending at each of the positions. 另外，如果你可以让所有的抗原是相同的长度（与短抗原消除您的数据集的部件和/或斩去长抗原的两端假设你知道一个先验的表位附近没有结束），你可以将它作为一个分类问题，包括两个任务（开始和结束）和序列长度类，你试图在每个位置开始和结束时为抗原分配一个概率。

How can I modify my code so that it can run faster? 如何修改我的代码以便它可以更快地运行？

Reducing the number of layers will speed your code up significantly. 减少层数将大大加快代码速度。 Also, GRUs will be faster than LSTMs due to their simpler architecture. 此外，由于结构更简单，GRU将比LSTM更快。 However, both types of recurrent networks will be slower than, eg convolutional networks. 但是，两种类型的循环网络都比例如卷积网络慢。

Feel free to send me an email (address in my profile) if you're interested in a collaboration. 如果您对合作感兴趣，请随时给我发送电子邮件（我的个人资料中的地址）。

使用深度学习来预测序列的子序列

问题描述

1 个解决方案

解决方案1
13 2016-05-15 21:40:54

使用深度学习来预测序列的子序列

问题描述

1 个解决方案

解决方案1 13 2016-05-15 21:40:54

解决方案1
13 2016-05-15 21:40:54