簡體   English   中英

如何在keras中構建嵌入層

[英]How to build embedding layer in keras

我正在嘗試按照 Francois Chollet 書中的一個教程在 tensorflow 中構建文本分類模型。 我試圖首先創建一個嵌入層,但在這個階段它一直在中斷。

我的邏輯如下:

  • 從文本字符串列表作為 X 和整數列表作為 y 開始。

  • 將文本數據標記化、向量化和填充到最長序列長度

  • 將每個整數標簽轉換為一個單熱編碼數組

  • 輸入帶有輸入的嵌入層:
    • input_dim = 唯一標記/單詞的總和(在我的情況下為 1499)
    • output_dim = 嵌入向量的維度大小(以 32 開頭)
    • input_length = 最大序列的長度,與序列填充到的維度相同(在我的情況下為 295)
  • 使用 relu 將嵌入結果傳遞到 32 個隱藏單元密集層
  • 將它們傳遞到具有 softmax 的 3 個隱藏單元密集層以預測 3 個類別

有人可以向我解釋我在這里做錯了什么嗎? 我以為我了解如何實例化嵌入層,但這不是正確的理解嗎?

這是我的代碼:

 # read in raw data df = pd.read_csv('text_dataset.csv') samples = df.data.tolist() # list of strings of text labels = df.sentiment.to_list() # list of integers # tokenize and vectorize text data to prepare for embedding tokenizer = Tokenizer() tokenizer.fit_on_texts(samples) sequences = tokenizer.texts_to_sequences(samples) word_index = tokenizer.word_index print(f'Found {len(word_index)} unique tokens.') # setting variables vocab_size = len(word_index) # 1499 # Input_dim: This is the size of the vocabulary in the text data. input_dim = vocab_size # 1499 # This is the size of the vector space in which words will be embedded. output_dim = 32 # recommended by tf # This is the length of input sequences max_sequence_length = len(max(sequences, key=len)) # 295 # train/test index splice variable training_samples = round(len(samples)*.8) # data = pad_sequences(sequences, maxlen=max_sequence_length) # shape (499, 295) # keras automatically pads to maxlen if left without input data = pad_sequences(sequences) # preprocess labels into one hot encoded array of 3 classes ([1., 0., 0.]) labels = to_categorical(labels, num_classes=3, dtype='float32') # shape (499, 3) # Create test/train data (80% train, 20% test) x_train = data[:training_samples] y_train = labels[:training_samples] x_test = data[training_samples:] y_test = labels[training_samples:] model = Sequential() model.add(Embedding(input_dim, output_dim, input_length=max_sequence_length)) model.add(Dense(32, activation='relu')) model.add(Dense(3, activation='softmax')) model.summary() model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

當我運行這個時,我收到這個錯誤:

 Found 1499 unique tokens. Model: "sequential_23" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_21 (Embedding) (None, 295, 32) 47968 _________________________________________________________________ dense_6 (Dense) (None, 295, 32) 1056 _________________________________________________________________ dense_7 (Dense) (None, 295, 3) 99 ================================================================= Total params: 49,123 Trainable params: 49,123 Non-trainable params: 0 _________________________________________________________________ --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-144-f29ef892e38d> in <module>() 51 epochs=10, 52 batch_size=32, ---> 53 validation_data=(x_test, y_test)) 2 frames /usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 129 ': expected ' + names[i] + ' to have ' + 130 str(len(shape)) + ' dimensions, but got array ' --> 131 'with shape ' + str(data_shape)) 132 if not check_batch_axis: 133 data_shape = data_shape[1:] ValueError: Error when checking target: expected dense_7 to have 3 dimensions, but got array with shape (399, 3)

為了排除故障,我一直在注釋層以嘗試查看發生了什么。 發現問題一直持續到第一層,讓我覺得自己對Embedding層的理解很差。 見下文:

 model = Sequential() model.add(Embedding(input_dim, output_dim, input_length=max_sequence_length)) # model.add(Dense(32, activation='relu')) # model.add(Dense(3, activation='softmax')) model.summary()

結果是:

 Found 1499 unique tokens. Model: "sequential_24" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_22 (Embedding) (None, 295, 32) 47968 ================================================================= Total params: 47,968 Trainable params: 47,968 Non-trainable params: 0 _________________________________________________________________ --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-150-63d1b96db467> in <module>() 51 epochs=10, 52 batch_size=32, ---> 53 validation_data=(x_test, y_test)) 2 frames /usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 129 ': expected ' + names[i] + ' to have ' + 130 str(len(shape)) + ' dimensions, but got array ' --> 131 'with shape ' + str(data_shape)) 132 if not check_batch_axis: 133 data_shape = data_shape[1:] ValueError: Error when checking target: expected embedding_22 to have 3 dimensions, but got array with shape (399, 3)

keras 中的密集層預計將采用只有 2 維[BATCH_SIZE, N]的平面輸入。 一個句子的嵌入層的輸出有 3 個維度: [BS, SEN_LENGTH, EMBEDDING_SIZE]

有 2 個選項可以解決這個問題:

  1. 在第一個密集層之前model.add(Flatten())嵌入層的輸出: model.add(Flatten())
  2. 使用卷積層(推薦): model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM