简体   繁体   中英

ValueError bad shape with pretrained embedding matrix in keras model

        user_id tags
0   234 drama , police , year , perfect , space , mech...
1   382 short normal , city , movie short , thriller ,...
2   741 world , tv short seasonal , school , life , pe...

I previously computed the 15 most relevant words for each users in my dataframe likes above and i build a pretrained embedding matrix with glove dataset.

GLOVE = 'Mypath/Anime_project/glove.6B.300d.txt'
embeddings_index = {}
with open(GLOVE,encoding='utf8') as f:
    for line in tqdm(f):
        values = line.rstrip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

then i use the Keras's Tokenizer

tags_doc['doc_len'] = tags_doc["tags"].apply(lambda words: len(words.split(",")))
max_seq_len = np.round(tags_doc['doc_len'].mean() + tags_doc['doc_len'].std()).astype(int)
docs = tags_doc["tags"].tolist()
processed_docs = " ".join(docs).split(" , ")
print("tokenizing input data...")
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs)  #leaky
word_sequence = tokenizer.texts_to_sequences(processed_docs)
word_index = tokenizer.word_index
print("dictionary size: ", len(word_index))

#pad sequences
word_padded = sequence.pad_sequences(word_sequence, maxlen=max_seq_len)
# split the data into a training set and a validation set
indices = np.arange(word_padded.shape[0])
np.random.shuffle(indices)
data = word_padded[indices]
VALIDATION_SPLIT=0.2
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]

shape of x_train is (904995, 15) and x_val (226248, 15)

embed_dim = 300
embedding_matrix = np.zeros((len(word_index) + 1, embed_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

then i add that matrix in Keras Functional API

embedding_layer = Embedding(len(word_index) + 1,
                            embed_dim,
                            weights=[embedding_matrix],
                            input_length=max_seq_len,
                            trainable=False)
sequence_input = Input(shape=(max_seq_len,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.2)(embedded_sequences)

then when i fit my model i got this error

ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(64642, 1), (64642, 1), (904995, 15)]

I understand that my problem is from the shape of my sequence input (x_train, x_val) but i don't know how to solve it ?

Seems like the length of x_train and y_train is not equal. Check their lengths.

    len(x_train)
    len(y_train)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM