简体   繁体   中英

Python TF-IDF product

I'm trying to create the TF-IDF from my TF_norm matrix and IDF vector. I know that they don't have the same dimensions, so I'm lost at how I can multiply the two together. Do I need to add reduce something with the TF_norm matrix or convert the IDF vector? Completely lost from here.

#c) Normalized term frequency
count=0 
total=lexicon_dim
matrix_TF_norm=[[0 for c in range(lexicon_dim)] for r in range(4)]
for c in lexicon:
    matrix_TF_norm[0][count]=c
    matrix_TF_norm[1][count]=hamlet_tok_norm_stop_stem.count(c)/total
    matrix_TF_norm[2][count]=macbeth_tok_norm_stop_stem.count(c)/total
    matrix_TF_norm[3][count]=pinocchio_tok_norm_stop_stem.count(c)/total
    count=count+1
print(matrix_TF_norm)
#d) TF-IDF
vector_idf=[] #initialize IDF vector
for i in range(lexicon_dim): #run through loop for each token in lexicon
    df=0
    if matrix_binary[1][i]==1: #[1] = doc1
        df=df+1
    if matrix_binary[2][i]==1:
        df=df+1
    if matrix_binary[3][i]==1:
        df=df+1
    #add them together
    idf=math.log(3/df)
    vector_idf.append(idf)
print(vector_idf)

import numpy as np
vector_idf=np.diag(vector_idf)
tf_idf=np.cross(vector_idf,matrix_TF_norm)

Kind of hard to follow your code, but I can break-down the dimensions and arithmetic operations.

  • It all begins with a fixed vocabulary, lets say of size N , which was extracted from some collection of texts.
  • This means that you have N IDF weights. This can either be a vector of size 1 XN or the diagonal of an NXN matrix which all zeros otherwise, both can work depending on the eventual arithmetic
  • Now lets say you have some collection of texts of size K (doesn't have to be the original collection used to extract the vocabulary). Each text will be tokenized, according to the vocabulary, into a vector of size N of term frequency counts, so that the entire K sized collection will become a matrix of size KXN .
  • So we have tf_matrix of KXN , idf_matrix of size NXN or idf_vector of size 1 XN . To get the tf_idf_matrix you either need to do a matrix multiplication: tf_matrix * idf_matrix or an element-wise matrix and vector multiplication tf_matrix * idf_vector. Both will achieve the goal of multiplying every i-th tf with the i-th idf weight.
  • You can do some normalizations in between some of these steps, but that will never change any of these dimensions, only the numeric values in the corresponding positions.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM