简体   繁体   中英

Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

I need multiply the weigths of terms in TFIDF matrix by the word-embeddings of word2vec matrix but I can't do it because each matrix have a different number of terms. I am using the same corpus for get both matrix, I don't know why each matrix have a different number of terms .

My problem is that I have a matrix TFIDF with the shape (56096, 15500) (corresponding to: number of terms, number of documents) and matrix Word2vec with the shape (300, 56184) (corresponding to : number of word-embeddings, number of terms).
And I need the same numbers of terms in both matrix.

I use this code for get the matrix of word-embeddings Word2vec:

def w2vec_gensim(norm_corpus):
    wpt = nltk.WordPunctTokenizer()
    tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
    # Set values for various parameters
    feature_size = 300
    # Word vector dimensionality
    window_context = 10
    # Context window size
    min_word_count = 1
    # Minimum word count
    sample = 1e-3
    # Downsample setting for frequent words
    w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, window=window_context, min_count =  min_word_count, sample=sample, iter=100)
    words = list(w2v_model.wv.vocab)
    vectors=[]
    for w in words:
        vectors.append(w2v_model[w].tolist())
    embedding_matrix= np.array(vectors)
    embedding_matrix= embedding_matrix.T
    print(embedding_matrix.shape)

    return embedding_matrix

And this code for get the TFIDF matrix:

tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True)


def matriz_tf_idf(datos, tv):
    tv_matrix = tv.fit_transform(datos)
    tv_matrix = tv_matrix.toarray()
    tv_matrix = tv_matrix.T
    return tv_matrix

And I need the same number of terms in each matrix. For example, if I have 56096 terms in TFIDF, I need the same number in embeddings matrix, I mean matrix TFIDF with the shape (56096, 1550) and matrix of embeddings Word2vec with the shape (300, 56096) . How I can get the same number of terms in both matrix? Because I can't delete without more data, due to I need the multiplication to make sense because my goal is to get the embeddings from the documents.

Thank you very much in advance.

The problem is that TFIDF is cutting out around 90 terms. This is because tokenize is neccesary. This is the solution:

wpt = nltk.WordPunctTokenizer()
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True,
                     tokenizer=wpt.tokenize)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM