user_id tags
0 234 drama , police , year , perfect , space , mech...
1 382 short normal , city , movie short , thriller ,...
2 741 world , tv short seasonal , school , life , pe...
I previously computed the 15 most relevant words for each users in my dataframe likes above and i build a pretrained embedding matrix with glove dataset.
GLOVE = 'Mypath/Anime_project/glove.6B.300d.txt'
embeddings_index = {}
with open(GLOVE,encoding='utf8') as f:
for line in tqdm(f):
values = line.rstrip().rsplit(' ')
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
then i use the Keras's Tokenizer
tags_doc['doc_len'] = tags_doc["tags"].apply(lambda words: len(words.split(",")))
max_seq_len = np.round(tags_doc['doc_len'].mean() + tags_doc['doc_len'].std()).astype(int)
docs = tags_doc["tags"].tolist()
processed_docs = " ".join(docs).split(" , ")
print("tokenizing input data...")
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs) #leaky
word_sequence = tokenizer.texts_to_sequences(processed_docs)
word_index = tokenizer.word_index
print("dictionary size: ", len(word_index))
#pad sequences
word_padded = sequence.pad_sequences(word_sequence, maxlen=max_seq_len)
# split the data into a training set and a validation set
indices = np.arange(word_padded.shape[0])
np.random.shuffle(indices)
data = word_padded[indices]
VALIDATION_SPLIT=0.2
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
shape of x_train is (904995, 15) and x_val (226248, 15)
embed_dim = 300
embedding_matrix = np.zeros((len(word_index) + 1, embed_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
then i add that matrix in Keras Functional API
embedding_layer = Embedding(len(word_index) + 1,
embed_dim,
weights=[embedding_matrix],
input_length=max_seq_len,
trainable=False)
sequence_input = Input(shape=(max_seq_len,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.2)(embedded_sequences)
then when i fit my model i got this error
ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(64642, 1), (64642, 1), (904995, 15)]
I understand that my problem is from the shape of my sequence input (x_train, x_val) but i don't know how to solve it ?
Seems like the length of x_train and y_train is not equal. Check their lengths.
len(x_train)
len(y_train)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.