简体   繁体   English

CTC模型不学习

[英]CTC model does not learn

I am trying to program a Keras model for audio transcription using connectionist temporal classification. 我正在尝试使用连接主义的时间分类对Keras模型进行音频转录编程。 Using a mostly working framewise classification model and the OCR example , I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription. 使用一个工作效率最高的逐帧分类模型和OCR示例 ,我想出了下面给出的模型,我想训练该模型来将德语句子的短时傅立叶变换映射到其语音转录。

My training data actually do have timing information, so I can use it to train a framewise model without CTC. 我的训练数据实际上确实具有时序信息,因此我可以使用它来训练没有CTC的框架模型。 The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%). 没有CTC损失的逐帧预测模型工作得很好(训练精度为80%,验证精度为50%)。 There is however much more potential training data available without timing information, so I really want to switch a CTC. 但是,在没有计时信息的情况下,还有更多潜在的训练数据可用,因此我真的想切换CTC。 To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function. 为了测试这一点,我从数据中删除了时序,将NULL类的输出大小增加了一个,并添加了CTC损失函数。

This CTC model does not seem to learn. 这种CTC模型似乎没有学到。 Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn][] are part of the sequence, representing the pause at start and end of the audio. 总体而言,损失并没有减少(在2000个句子中从80个句子减少到180个,但后来又增加到430个),并且它产生的最大似然输出在[nh每个句子中都出现,通常包含大约六个单词和类似[foːɐmʔɛsndʰaɪnəhɛndəvaʃn][]转录是序列的一部分,代表音频开始和结束处的停顿。

I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. 我发现很难在Keras中找到关于CTC的很好的解释,所以可能是我做了一些愚蠢的事情。 Did I mess up the model, mixing up the order of arguments somewhere? 我是否弄乱了模型,在某个地方混淆了参数的顺序? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? 在给模型完整的句子之前,我是否需要更加谨慎地训练模型,可能从音频片段开始,每个片段具有一种,两种或三种声音。 In short, 简而言之,

How do I get this CTC model to learn? 我该如何学习这种CTC模型?

connector = inputs
for l in [100, 100, 150]:
    lstmf, lstmb = Bidirectional(
        LSTM(
            units=l,
            dropout=0.1,
            return_sequences=True,
        ), merge_mode=None)(connector)

    connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])

output = Dense(
    units=len(dataset.SEGMENTS)+1,
    activation=softmax)(connector)

loss_out = Lambda(
    ctc_lambda_func, output_shape=(1,),
    name='ctc')([output, labels, input_length, label_length])

ctc_model = Model(
    inputs=[inputs, labels, input_length, label_length],
    outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
                  optimizer=SGD(
                      lr=0.02,
                      decay=1e-6,
                      momentum=0.9,
                      nesterov=True,
                      clipnorm=5))

ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example. ctc_lambda_function和根据预测生成序列的代码来自OCR示例。

It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository . 从这里给出的代码中完全看不到它,但是OP在其他地方提供了指向其Github存储库的链接。 The error lies actually in the data preparation: 错误实际上在于数据准备:

The data are log spectrograms. 数据是对数频谱图。 They are unnormalized, and mostly highly negative. 他们是不规范的,而且大多是负面的。 The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. CTC功能获取标签的一般分布的速度比LSTM层适应其输入偏差和输入权重的速度快得多,因此输入中的所有变化都可以被压平。 The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally. 当尚未假定标签的边缘化分布时,局部的最小损失可能来自时期。

The solution to this is to scale the input spectrograms such that they contain both positive and negative values: 解决方案是缩放输入频谱图,使其包含正值和负值:

for i, file in enumerate(files):
    sg = numpy.load(file.with_suffix(".npy").open("rb"))
    spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM