简体   繁体   English

将TfidfVectorizer的fit_transform变量转换为数组(.toarray())会使所有内容都为零?

[英]Converting TfidfVectorizer's fit_transform variable to an array (.toarray()) makes everything zero?

I'm experimenting with tfidf with a sample dataset, and everything is working fine up until I convert my fit-transofrm variable to an array. 我正在尝试使用样本数据集的tfidf,一切正常,直到我将fit-transofrm变量转换为数组。 I am trying to view my "features" after using tfidf, and the values make sense when i print it. 我试图在使用tfidf后查看我的“功能”,当我打印它时,这些值是有意义的。 However, when i print it as an array, then all values become zero for some reason. 但是,当我将其作为数组打印时,由于某种原因所有值都变为零。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

filename = 'test.csv'
df = pd.read_csv(filename)

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Comments.astype('U'))
label_map = {'Y': 1, 'N': 0}
labels = df.Non_Comp_IND.map(label_map)
print(features)

the result from the last line of the print statement is this: print语句最后一行的结果如下:

(0, 6433)   0.1354882591295125
  (0, 18430)    0.057506963357173674
  (0, 16902)    0.0887002305355381
  (0, 17540)    0.46335455366392575
  (0, 19175)    0.2159334960329325
  (0, 16590)    0.15130364285967984
(0, 9104)   0.15285500637985408
  (0, 16595)    0.1890315464705662
  : :
(24455, 14202)  0.17695626302265938
  (24455, 6699) 0.2309569171857742
  (24455, 10308)    0.2279428326498053
  (24455, 16678)    0.2343740044032419
  (24455, 12122)    0.23831874209561996
  (24455, 18919)    0.23831874209561996

The numbers above make sense, but when I change the line to print it in an array format print(features.toarray()) , this is what i get: 上面的数字是有道理的,但当我更改行以数组格式print(features.toarray()) ,这就是我得到的:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

I did check each value, for example features.toarray()[3][10] , they are all 0.0 我确实检查了每个值,例如features.toarray()[3][10] ,它们都是0.0

我发现我正在测试它不正确,在了解了更多关于tf-idf矩阵之后实际上有一些值,每个单词都有自己的列,所以只有文档中unhique的单词才会在矩阵中有一个值,它们并非都是零。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 fit_transform、transform 和 TfidfVectorizer 的工作原理 - How fit_transform, transform and TfidfVectorizer works 如何在两列上使用 sklearn TfidfVectorizer fit_transform - How to use sklearn TfidfVectorizer fit_transform on two columns fit_transform之后的数组大小不同 - Different size of array after fit_transform python 使用 fit_transform(data) 将字符串值转换为数值 - python converting string values to numerical with fit_transform(data) 使用 fit_transform() 和 transform() - Using fit_transform() and transform() fit_transform(image)TSNE方法的数字数组格式 - Numpy array format for fit_transform(image) TSNE method 用训练数据进行fit_transform并通过测试进行变换 - fit_transform with the training data and transform with the testing Python 的“StandardScaler”和“LabelEncoder”以及“fit”和“fit_transform”不适用于同时包含浮点数和字符串的 CSV - Python's “StandardScaler” and “LabelEncoder”, and “fit” and “fit_transform” do not work with a CSV which contains both float and string scikit learn's fit_transform是否也会改变我原来的数据帧? - Does scikit learn's fit_transform also transform my original dataframe? 具有 fit_transform 错误的列转换器 - Column Transformer with fit_transform error
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM