Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

Question

I need multiply the weigths of terms in TFIDF matrix by the word-embeddings of word2vec matrix but I can't do it because each matrix have a different number of terms. I am using the same corpus for get both matrix, I don't know why each matrix have a different number of terms .

My problem is that I have a matrix TFIDF with the shape (56096, 15500) (corresponding to: number of terms, number of documents) and matrix Word2vec with the shape (300, 56184) (corresponding to : number of word-embeddings, number of terms).
And I need the same numbers of terms in both matrix.

I use this code for get the matrix of word-embeddings Word2vec:

def w2vec_gensim(norm_corpus):
    wpt = nltk.WordPunctTokenizer()
    tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
    # Set values for various parameters
    feature_size = 300
    # Word vector dimensionality
    window_context = 10
    # Context window size
    min_word_count = 1
    # Minimum word count
    sample = 1e-3
    # Downsample setting for frequent words
    w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, window=window_context, min_count =  min_word_count, sample=sample, iter=100)
    words = list(w2v_model.wv.vocab)
    vectors=[]
    for w in words:
        vectors.append(w2v_model[w].tolist())
    embedding_matrix= np.array(vectors)
    embedding_matrix= embedding_matrix.T
    print(embedding_matrix.shape)

    return embedding_matrix

And this code for get the TFIDF matrix:

tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True)


def matriz_tf_idf(datos, tv):
    tv_matrix = tv.fit_transform(datos)
    tv_matrix = tv_matrix.toarray()
    tv_matrix = tv_matrix.T
    return tv_matrix

And I need the same number of terms in each matrix. For example, if I have 56096 terms in TFIDF, I need the same number in embeddings matrix, I mean matrix TFIDF with the shape (56096, 1550) and matrix of embeddings Word2vec with the shape (300, 56096) . How I can get the same number of terms in both matrix? Because I can't delete without more data, due to I need the multiplication to make sense because my goal is to get the embeddings from the documents.

Thank you very much in advance.

Answer 1

The problem is that TFIDF is cutting out around 90 terms. This is because tokenize is neccesary. This is the solution:

wpt = nltk.WordPunctTokenizer()
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True,
                     tokenizer=wpt.tokenize)

Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

Question

1 answers

solution1
0 ACCPTED 2020-03-12 18:48:14

Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

Question

1 answers

solution1 0 ACCPTED 2020-03-12 18:48:14

solution1
0 ACCPTED 2020-03-12 18:48:14