了解Keras中語音識別的CTC丟失

Question

我試圖了解CTC丟失如何用於語音識別以及如何在Keras中實現它。

我認為我了解的內容（如果我錯了，請糾正我！）

總的來說，CTC損失是在經典網絡的頂部添加的，以便逐個元素地解碼順序的信息元素（文本或語音的逐個字母字母），而不是直接對元素塊直接解碼（例如，一個單詞）。

假設我們正在將某些句子的發音作為MFCC提供。

使用CTC損失的目的是學習如何在每個時間步驟使每個字母與MFCC相匹配。 因此，Dense + softmax輸出層由與構成句子所需的元素數量一樣多的神經元組成：

字母（a，b，...，z）
空白令牌（-）
一個空格（_）和一個結束字符（>）

然后，softmax層具有29個神經元（26個為字母+一些特殊字符）。

為了實現它，我發現我可以做這樣的事情：

# CTC implementation from Keras example found at https://github.com/keras- 
# team/keras/blob/master/examples/image_ocr.py

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    # print "y_pred_shape: ", y_pred.shape
    y_pred = y_pred[:, 2:, :]
    # print "y_pred_shape: ", y_pred.shape
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)



input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)

x = Bidirectional(lstm(...,return_sequences=True))(input_data)

x = Bidirectional(lstm(...,return_sequences=True))(x)

y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)

loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
                  [y_pred, y_true, input_length, label_length])

model = Model(inputs=[input_data, y_true, input_length,label_length], 
                      outputs=loss_out)

ALPHABET_LENGTH = 29（字母長度+特殊字符）

和：

y_true ：包含真標簽的張量（樣本，max_string_length）。
y_pred ：包含預測或softmax輸出的張量（樣本，time_steps，num_categories）。
input_length ：張量（樣本，1），包含y_pred中每個批處理項目的序列長度。
label_length ：張量（樣本，1），包含y_true中每個批處理項目的序列長度。

（來源）

現在，我面臨一些問題：

我不明白的
- 這種植入是編碼和使用CTC損失的正確方法嗎？
- 我不明白具體是y_true ， input_length和label_length 。 有什么例子嗎？
- 我應該以什么形式給網絡貼標簽？ 再次，有什么例子嗎？

Answer 1

這些是什么？

y_true您的地面真實數據。 您將要在訓練中將要與模型輸出進行比較的數據。 （另一方面， y_pred是模型的計算輸出）
input_length ，長度（以步驟，或字符這種情況下）在每個樣品（句子）的y_pred張量（作為所述此處）
label_length ， y_true （或標簽）張量中每個樣本（句子）的長度（分步，在這種情況下為char）。

似乎這種損失期望您模型的輸出（ y_pred ）具有不同的長度，以及您的地面真實數據（ y_true ）。 這可能是為了避免在句子結束后計算垃圾字符的損失（因為您需要固定大小的張量才能一次處理大量句子）

標簽形式：

由於函數的文檔要求輸入形狀(samples, length) ，因此格式為...每個句子中每個char的char索引。

如何使用？

有一些可能性。

1-如果您不在乎長度：

如果所有長度都相同，則可以輕松地將其用作常規損失：

def ctc_loss(y_true, y_pred):

    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    #where input_length and label_length are constants you created previously
    #the easiest way here is to have a fixed batch size in training 
    #the lengths should have the same batch size (see shapes in the link for ctc_cost)    

model.compile(loss=ctc_loss, ...)   

#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)

2-如果您在乎長度。

這有點復雜，您需要模型以某種方式告訴您每個輸出語句的長度。
再次有幾種創造性的形式可以做到這一點：

具有一個“ end_of_sentence”字符，並檢測其在句子中的位置。
有模型的一個分支來計算此數字並將其四舍五入為整數。
（鐵桿）如果您使用的是有狀態的手動訓練循環，請獲取您決定完成一個句子的迭代的索引

我喜歡第一個想法，這里將舉例說明。

def ctc_find_eos(y_true, y_pred):

    #convert y_pred from one-hot to label indices
    y_pred_ind = K.argmax(y_pred, axis=-1)

    #to make sure y_pred has one end_of_sentence (to avoid errors)
    y_pred_end = K.concatenate([
                                  y_pred_ind[:,:-1], 
                                  eos_index * K.ones_like(y_pred_ind[:,-1:])
                               ], axis = 1)

    #to make sure the first occurrence of the char is more important than subsequent ones
    occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())

    #is eos?
    is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
    is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))

    #lengths
    true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
    pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)

    #reshape
    true_lengths = K.reshape(true_lengths, (-1,1))
    pred_lengths = K.reshape(pred_lengths, (-1,1))

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

model.compile(loss=ctc_find_eos, ....)

如果使用其他選項，請使用模型分支來計算長度，將這些長度連接到輸出的第一步或最后一步，並確保對基本事實數據中的真實長度進行相同的操作。 然后，在損失函數中，僅采用長度部分：

def ctc_concatenated_length(y_true, y_pred):

    #assuming you concatenated the length in the first step
    true_lengths = y_true[:,:1] #may need to cast to int
    y_true = y_true[:, 1:]

    #since y_pred uses one-hot, you will need to concatenate to full size of the last axis, 
    #thus the 0 here
    pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
    y_pred = y_pred[:, 1:]

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

了解Keras中語音識別的CTC丟失

問題描述

1 個解決方案

解決方案1
0 已采納 2019-08-09 00:17:36

這些是什么？

標簽形式：

如何使用？

1-如果您不在乎長度：

2-如果您在乎長度。

了解Keras中語音識別的CTC丟失

問題描述

1 個解決方案

解決方案1 0 已采納 2019-08-09 00:17:36

這些是什么？

標簽形式：

如何使用？

1-如果您不在乎長度：

2-如果您在乎長度。

解決方案1
0 已采納 2019-08-09 00:17:36