简体   繁体   中英

Word2vec with Conv1D for text classification confusion

I am doing text classification and plan to use word2vec word embeddings and pass it to Conv1D layers for text classification. I have a dataframe which contains the texts and corresponding labels(sentiments). I have used the gensim module and used word2vec algorithm to generate the word-embedding model. The code I used:

import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
df=pd.read_csv('emotion_merged_dataset.csv')
texts=df['text']
labels=df['sentiment']
df_tokenized=df.apply(lambda row: word_tokenize(row['text']), axis=1)
model = Word2Vec(df_tokenized, min_count=1)

I plan to use CNN and use this word-embedding model. But how should I use this word-embedding model for my cnn? What should be my input?

I plan to use something like(obviously not with the same hyper-parameters):

model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

Can somebody help me out and point me in the right direction? Thanks in advance.

Sorry for the late response, I hope it is still useful for you. Depending on your application you may need to download a specific wordembedding file, for example here yoou have the Glove files

EMBEDDING_FILE='glove.6B.50d.txt'

embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

#this is how you load the weights in the embedding layer
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)

I took this code from Jeremy Howard , I think this is all you need, if you want to load other file the process is pretty similar, usually you just have to change the loading file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM