I'm trying to create the TF-IDF from my TF_norm matrix and IDF vector. I know that they don't have the same dimensions, so I'm lost at how I can multiply the two together. Do I need to add reduce something with the TF_norm matrix or convert the IDF vector? Completely lost from here.
#c) Normalized term frequency
count=0
total=lexicon_dim
matrix_TF_norm=[[0 for c in range(lexicon_dim)] for r in range(4)]
for c in lexicon:
matrix_TF_norm[0][count]=c
matrix_TF_norm[1][count]=hamlet_tok_norm_stop_stem.count(c)/total
matrix_TF_norm[2][count]=macbeth_tok_norm_stop_stem.count(c)/total
matrix_TF_norm[3][count]=pinocchio_tok_norm_stop_stem.count(c)/total
count=count+1
print(matrix_TF_norm)
#d) TF-IDF
vector_idf=[] #initialize IDF vector
for i in range(lexicon_dim): #run through loop for each token in lexicon
df=0
if matrix_binary[1][i]==1: #[1] = doc1
df=df+1
if matrix_binary[2][i]==1:
df=df+1
if matrix_binary[3][i]==1:
df=df+1
#add them together
idf=math.log(3/df)
vector_idf.append(idf)
print(vector_idf)
import numpy as np
vector_idf=np.diag(vector_idf)
tf_idf=np.cross(vector_idf,matrix_TF_norm)
Kind of hard to follow your code, but I can break-down the dimensions and arithmetic operations.
N
, which was extracted from some collection of texts. N
IDF weights. This can either be a vector of size 1 XN
or the diagonal of an NXN
matrix which all zeros otherwise, both can work depending on the eventual arithmetic K
(doesn't have to be the original collection used to extract the vocabulary). Each text will be tokenized, according to the vocabulary, into a vector of size N
of term frequency counts, so that the entire K
sized collection will become a matrix of size KXN
. KXN
, idf_matrix of size NXN
or idf_vector of size 1 XN
. To get the tf_idf_matrix you either need to do a matrix multiplication: tf_matrix * idf_matrix or an element-wise matrix and vector multiplication tf_matrix * idf_vector. Both will achieve the goal of multiplying every i-th
tf with the i-th
idf weight. Hope this helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.