简体   繁体   中英

CTC model does not learn

I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example , I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.

My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%). There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.

This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn][] are part of the sequence, representing the pause at start and end of the audio.

I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,

How do I get this CTC model to learn?

connector = inputs
for l in [100, 100, 150]:
    lstmf, lstmb = Bidirectional(
        LSTM(
            units=l,
            dropout=0.1,
            return_sequences=True,
        ), merge_mode=None)(connector)

    connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])

output = Dense(
    units=len(dataset.SEGMENTS)+1,
    activation=softmax)(connector)

loss_out = Lambda(
    ctc_lambda_func, output_shape=(1,),
    name='ctc')([output, labels, input_length, label_length])

ctc_model = Model(
    inputs=[inputs, labels, input_length, label_length],
    outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
                  optimizer=SGD(
                      lr=0.02,
                      decay=1e-6,
                      momentum=0.9,
                      nesterov=True,
                      clipnorm=5))

ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example.

It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository . The error lies actually in the data preparation:

The data are log spectrograms. They are unnormalized, and mostly highly negative. The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally.

The solution to this is to scale the input spectrograms such that they contain both positive and negative values:

for i, file in enumerate(files):
    sg = numpy.load(file.with_suffix(".npy").open("rb"))
    spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM