Keras RNN，輸入的形狀不正確，即使形狀顯示為正確

Question

我正在嘗試訓練 RNN 對文本進行分類。 在我的電腦上，我有一個包含所有短語的大文本文件，用於為每個類別（總共 2 個）訓練網絡，例如

第 1 句
短語 2
第 3 句

然后我將其轉換為 keras 數據集使用

tf.data.TextLineDataset(
directory)

這沒有貼在物品上的標簽，所以我使用了 function

directory.map(lambda ex: labeler(ex,2))

它為所有項目添加了標簽，留下如下所示的數據集：

<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

然后使用.skip和.take將其分成驗證集和訓練集。 然后使用category1 = category1.concatenate(category2)將這兩個類別組合成一個驗證和一個訓練數據集

然后我創建了一個看起來像這樣的矢量化層：

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)

  return vectorize_layer(text), label

並通過 function 運行訓練和驗證集以向量化所有短語。 然后留下一個如下所示的數據集：

<MapDataset element_spec=(TensorSpec(shape=(None, 250), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

一個項目的例子是

(<tf.Tensor: shape=(1, 250), dtype=int64, numpy=array([[   1,   28,   12, 1199, 3445,   61,   31,  166,  163,   13,   28,
           2,   97,   13,    6,  206,  625,  972,  344,    7, 2790,   11,
           1, 1379, 3615,   24,    1,    2,   27,   21,    3,  435,    4,
          16,    1,   15,   22,    1,    3,  127,    2,   13,   36,    8,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=2>),

如您所見，項目的形狀為 1,250，並且還有另一個張量表示它所屬的類別，在本例中為 #2。 然后我通過我的 model 喂它，那里東西壞了。 model 是這樣的：

model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(1,250,), activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='binary_crossentropy', metrics=['accuracy','Precision','Recall'])
model.fit(train_set,batch_size=32,epochs=1)

但是當我運行代碼時，我得到了錯誤

 ValueError: Input 0 of layer "sequential_54" is incompatible with the layer: expected shape=(None, 1, 250), found shape=(None, 250)

為了解決這個問題，我嘗試添加一個重塑層，但沒有奏效。 我也嘗試使用 np.expand_dims，但這也無法解決問題。 有沒有人有辦法解決嗎？ 此外，某些函數（例如 train_set.shape）會給出錯誤，例如ConcatenateDataset object has no attribute shape

編輯：數據預處理
分裂和提取


def labeler(example, index):  #function to label items 
  return example, tf.cast(index, tf.int64)

train_set_1 = tf.data.TextLineDataset( #get data 
    "comments1.txt",
    compression_type=None,
    buffer_size=None,
    num_parallel_reads=None,
    name=None
)

#split data into val and training 

val_set_1 = train_set_1.skip(int(1200*8/10)) #1200 is the number of items so 80:20 split
train_set_1 = train_set_1.take(int(1200*8/10))
#label both sets 
labeled_train_1 = train_set_1.map(lambda ex: labeler(ex, 1)) #1 is the label
labeled_val_1 = val_set_1.map(lambda ex: labeler(ex, 1))
print(labeled_train_1)


print(train_set_1)

#repeat for set 2 
train_set_2 = tf.data.TextLineDataset(
    "comments2.txt",
    compression_type=None,
    buffer_size=None,
    num_parallel_reads=None,
    name=None
)
val_set_2 = train_set_2.skip(int(1200*8/10))
train_set_2= train_set_2.take(int(1200*8/10))
labeled_train_2 = train_set_2.map(lambda ex: labeler(ex, 2))
labeled_val_2 = val_set_2.map(lambda ex: labeler(ex, 2))

矢量化

#len(counter) is total number of words, max_length is 250
vectorize_layer = tf.keras.layers.TextVectorization( max_tokens=len(counter), output_mode='int', output_sequence_length=max_length)
vectorize_layer.adapt(train_set_1)
vectorize_layer.adapt(train_set_2)  


def vectorize_text(text, label): #this is where i can change the dimensions and vectorize the whole sequence 
  text = tf.expand_dims(text, 0)
  text = tf.expand_dims(text, -1)

  return vectorize_layer(text), label

#actually vectorizing and combining all the text
train_1= labeled_train_1.map(vectorize_text)
train_2 = labeled_train_2.map(vectorize_text)
train_set = train_1.concatenate(train_2)
val_1 = labeled_val_1.map(vectorize_text)
val_2 = labeled_val_2.map(vectorize_text)
val_set = val_1.concatenate(val_2)
print(val_2)
print(list(val_2))

接下來，它只是被送入 model。
編輯2：
我發現一個筆記本做與我的項目類似的事情，它似乎在神經網絡中使用了嵌入層，所以我認為嵌入層可能會有所幫助。 我已經嘗試過它並使用了不同的擴展暗淡配置，但仍然沒有解決方案，但該圖層可能很有用。

這就是我目前正在搞砸的：

model.add(keras.layers.Embedding(len(counter),250,input_length=1))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

但我也嘗試過len(counter),1,input_length=250 。
編輯 3：我已設法將維度更改為 250,1 而不是 1,250，但我收到一條錯誤消息，指出擬合循環的輸入形狀為 none,1,1。 這似乎問題可能是輸入既是標記化的詞，它是一個大小為 250,1 的張量，也是答案，即數據集 1 或數據集 2，它是另一個張量，導致一個包含 2 個張量的張量，這可能會給出無、1、1 的大小。

Answer 1

你只需要擴大你的暗淡，即使你可能錯過了一個步驟，但我稍后會談到......修復應該是：

model.fit(tf.expand_dim(train_set, 1),batch_size=32,epochs=1)

現在，關於暗淡：
RNN 期望單個元素的形狀為(None, X) ，其中 X 為正 integer。
第一個None代表您的短語/序列的長度，因為它可能會有所不同，它使用None以避免必須手動修復它。
第二個維度X代表序列中元素的“特征”。 以天氣預報為例，這些特征是風速、濕度等。

話雖如此，您的序列應該被編碼為(250, 1)因為您有“序列中的 250 個單詞/元素”，並且每個單詞都有 1 個特征（對應於它的 integer）。

鑒於此，在我看來，您應該使用以下內容：

model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(250,1), activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(128, activation="relu", return_sequences=True))
model.add(keras.layers.LSTM(32, activation="relu", return_sequences=False))
model.add(keras.layers.Dense(1,activation="relu"))

model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='binary_crossentropy', metrics=['accuracy','Precision','Recall'])
model.fit(tf.expand_dim(train_set, -1),batch_size=32,epochs=1)

您可以在此頁面的文檔中看到此內容：

致電 arguments
輸入：形狀為 [batch, timesteps, feature] 的 3D 張量。

Answer 2

我找到了解決方案。 問題是矢量化層的工作方式存在錯誤，這會導致它有時返回一個空數組而不是一個填充數組。 因此，我不得不使用將數據集轉換為數組

def dataset_to_numpy(ds):

    #Convert tensorflow dataset to numpy arrays

    texts = []
    labels = []

    # Iterate over a dataset
    for i, (text, label) in enumerate(tfds.as_numpy(ds)):
        texts.append(text)
        labels.append(label)

    for i, txt in enumerate(texts):
        if i < 3:
            print(txt.shape, labels[i])

    return texts, labels

我通過 function 運行了驗證集和訓練集，然后使用了這個 function

to_del = [] 
  for i in range(len(train_set[0])):
    if train_set[0][i].shape != (250, 1):
      print(i)
      to_del.append(i)

這得到了要刪除的項目。 然后我刪除了數組中的那些項目並運行了這段代碼


train_set[0] = list(train_set[0])
train_set[1] = list(train_set[1])
train_set[0] = np.array([np.array(val) for val in train_set[0]])
train_set[1] = np.array([np.array(val) for val in train_set[1]])
train_set[1] = np.expand_dims(train_set[1],-1)

這將列表變成了 np arrays 並且還擴展了目標的尺寸。 最后，我將 train_set 轉換為 x_train 和 y_train 並將其輸入到它開始訓練的網絡中。

Keras RNN，輸入的形狀不正確，即使形狀顯示為正確

問題描述

2 個解決方案

解決方案1
3 2022-07-27 17:24:04

解決方案2
0 已采納 2022-07-29 12:03:09

Keras RNN，輸入的形狀不正確，即使形狀顯示為正確

問題描述

2 個解決方案

解決方案1 3 2022-07-27 17:24:04

解決方案2 0 已采納 2022-07-29 12:03:09

解決方案1
3 2022-07-27 17:24:04

解決方案2
0 已采納 2022-07-29 12:03:09