[英]Converting TfidfVectorizer's fit_transform variable to an array (.toarray()) makes everything zero?
I'm experimenting with tfidf with a sample dataset, and everything is working fine up until I convert my fit-transofrm variable to an array. 我正在尝试使用样本数据集的tfidf,一切正常,直到我将fit-transofrm变量转换为数组。 I am trying to view my "features" after using tfidf, and the values make sense when i print it.
我试图在使用tfidf后查看我的“功能”,当我打印它时,这些值是有意义的。 However, when i print it as an array, then all values become zero for some reason.
但是,当我将其作为数组打印时,由于某种原因所有值都变为零。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
filename = 'test.csv'
df = pd.read_csv(filename)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Comments.astype('U'))
label_map = {'Y': 1, 'N': 0}
labels = df.Non_Comp_IND.map(label_map)
print(features)
the result from the last line of the print statement is this: print语句最后一行的结果如下:
(0, 6433) 0.1354882591295125
(0, 18430) 0.057506963357173674
(0, 16902) 0.0887002305355381
(0, 17540) 0.46335455366392575
(0, 19175) 0.2159334960329325
(0, 16590) 0.15130364285967984
(0, 9104) 0.15285500637985408
(0, 16595) 0.1890315464705662
: :
(24455, 14202) 0.17695626302265938
(24455, 6699) 0.2309569171857742
(24455, 10308) 0.2279428326498053
(24455, 16678) 0.2343740044032419
(24455, 12122) 0.23831874209561996
(24455, 18919) 0.23831874209561996
The numbers above make sense, but when I change the line to print it in an array format print(features.toarray())
, this is what i get: 上面的数字是有道理的,但当我更改行以数组格式
print(features.toarray())
,这就是我得到的:
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
I did check each value, for example features.toarray()[3][10]
, they are all 0.0
我确实检查了每个值,例如
features.toarray()[3][10]
,它们都是0.0
我发现我正在测试它不正确,在了解了更多关于tf-idf矩阵之后实际上有一些值,每个单词都有自己的列,所以只有文档中unhique的单词才会在矩阵中有一个值,它们并非都是零。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.