将TfidfVectorizer的fit_transform变量转换为数组（.toarray（））会使所有内容都为零？

Question

I'm experimenting with tfidf with a sample dataset, and everything is working fine up until I convert my fit-transofrm variable to an array. 我正在尝试使用样本数据集的tfidf，一切正常，直到我将fit-transofrm变量转换为数组。 I am trying to view my "features" after using tfidf, and the values make sense when i print it. 我试图在使用tfidf后查看我的“功能”，当我打印它时，这些值是有意义的。 However, when i print it as an array, then all values become zero for some reason. 但是，当我将其作为数组打印时，由于某种原因所有值都变为零。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

filename = 'test.csv'
df = pd.read_csv(filename)

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Comments.astype('U'))
label_map = {'Y': 1, 'N': 0}
labels = df.Non_Comp_IND.map(label_map)
print(features)

the result from the last line of the print statement is this: print语句最后一行的结果如下：

(0, 6433)   0.1354882591295125
  (0, 18430)    0.057506963357173674
  (0, 16902)    0.0887002305355381
  (0, 17540)    0.46335455366392575
  (0, 19175)    0.2159334960329325
  (0, 16590)    0.15130364285967984
(0, 9104)   0.15285500637985408
  (0, 16595)    0.1890315464705662
  : :
(24455, 14202)  0.17695626302265938
  (24455, 6699) 0.2309569171857742
  (24455, 10308)    0.2279428326498053
  (24455, 16678)    0.2343740044032419
  (24455, 12122)    0.23831874209561996
  (24455, 18919)    0.23831874209561996

The numbers above make sense, but when I change the line to print it in an array format print(features.toarray()) , this is what i get: 上面的数字是有道理的，但当我更改行以数组格式print(features.toarray()) ，这就是我得到的：

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

I did check each value, for example features.toarray()[3][10] , they are all 0.0 我确实检查了每个值，例如features.toarray()[3][10] ，它们都是0.0

Answer 1

我发现我正在测试它不正确，在了解了更多关于tf-idf矩阵之后实际上有一些值，每个单词都有自己的列，所以只有文档中unhique的单词才会在矩阵中有一个值，它们并非都是零。

将TfidfVectorizer的fit_transform变量转换为数组（.toarray（））会使所有内容都为零？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-06-05 16:52:06

将TfidfVectorizer的fit_transform变量转换为数组（.toarray（））会使所有内容都为零？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-06-05 16:52:06

解决方案1
0 已采纳 2019-06-05 16:52:06