简体   繁体   English

矢量化fit_transform如何在sklearn中工作?

[英]How vectorizer fit_transform work in sklearn?

I'm trying to understand the following code 我正在尝试理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

When I try to print X to see what will be return, I got this result : 当我尝试打印X以查看将返回的内容时,我得到了以下结果:

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

However, I don't understand the meaning of this result ? 但是,我不明白这个结果的含义?

You can interpret this as "(sentence_index, feature_index) count" 您可以将其解释为“(sentence_index,feature_index)count”

As there are 3 sentence: it starts from 0 and ends at 2 因为有3个句子:它从0开始到2结束

feature index is word index which u can get from vectorizer.vocabulary_ 特征索引是你可以从vectorizer获得的单词索引。词汇_

-> vocabulary_ a dictionary {word:feature_index,...} - >词汇_词典{word:feature_index,...}

so for the example (0, 1) 1 所以对于例子(0,1)1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

instead of count vectorizer, if you use tfidf vectorizer see here it will give u tfidf values. 而不是计数矢量化器,如果你使用tfidf矢量化器, 请看这里它将给你tfidf值。 I hope I made it clear 我希望我说清楚

As @Himanshu writes, this is a "(sentence_index, feature_index) count" 正如@Himanshu所写,这是一个“(sentence_index,feature_index)计数”

Here, the count part is the "number of times a word appears in a document" 这里,计数部分是“单词出现在文档中的次数”

For example, 例如,

(0, 1) 1 (0,1)1

(0, 2) 1 (0,2)1

(0, 6) 1 (0,6)1

(0, 3) 1 (0,3)1

(0, 8) 1 (0,8)1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence (1,5)2仅对于此示例,计数“2”表示在本文档/句子中单词“和”出现两次

(1, 1) 1 (1,1)1

(1, 6) 1 (1,6)1

(1, 3) 1 (1,3)1

(1, 8) 1 (1,8)1

(2, 4) 1 (2,4)1

(2, 7) 1 (2,7)1

(2, 0) 1 (2,0)1

(2, 6) 1 (2,6)1

(3, 1) 1 (3,1)1

(3, 2) 1 (3,2)1

(3, 6) 1 (3,6)1

(3, 3) 1 (3,3)1

(3, 8) 1 (3,8)1

Let's change the corpus in your code. 让我们改变代码中的语料库。 Basically, I added the word "second" twice in the second sentence of the corpus list. 基本上,我在语料库列表的第二句中添加了两次“第二”一词。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1 (0,1)1

(0, 2) 1 (0,2)1

(0, 6) 1 (0,6)1

(0, 3) 1 (0,3)1

(0, 8) 1 (0,8)1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence (1,5)4对于修改后的语料库,计数“4”表示在本文/句中单词“second”出现4次

(1, 1) 1 (1,1)1

(1, 6) 1 (1,6)1

(1, 3) 1 (1,3)1

(1, 8) 1 (1,8)1

(2, 4) 1 (2,4)1

(2, 7) 1 (2,7)1

(2, 0) 1 (2,0)1

(2, 6) 1 (2,6)1

(3, 1) 1 (3,1)1

(3, 2) 1 (3,2)1

(3, 6) 1 (3,6)1

(3, 3) 1 (3,3)1

(3, 8) 1 (3,8)1

It transforms text to numbers. 它将文本转换为数字。 So with other functions you will be able to count how many times each word existed in the given data set. 因此,使用其他功能,您将能够计算给定数据集中每个单词存在多少次。 Im new to programming so maybe there are other fields to use as well. 我是编程新手,所以也许还有其他领域可供使用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Vectorizer fit_transform 按文档计数 - Vectorizer fit_transform count by document 为什么fit_transform在此sklearn Pipeline示例中不起作用? - Why doesn't fit_transform work in this sklearn Pipeline example? Python sklearn:fit_transform()不适用于GridSearchCV - Python sklearn : fit_transform() does not work for GridSearchCV 如何在两列上使用 sklearn TfidfVectorizer fit_transform - How to use sklearn TfidfVectorizer fit_transform on two columns 如何将 sklearn 预处理器 fit_transform 与 pandas.groupby.transform 一起使用 - How to use sklearn preprocessor fit_transform with pandas.groupby.transform sklearn countvectorizer 中的 fit_transform 和 transform 有什么区别? - What is the difference between fit_transform and transform in sklearn countvectorizer? sklearn中的'transform'和'fit_transform'有什么区别 - what is the difference between 'transform' and 'fit_transform' in sklearn 为什么vectorizer.fit_transform(x).astype('bool')与vectorizer.set_params(binary = True).fit_transform(x)不同? - Why is vectorizer.fit_transform(x).astype('bool') different from vectorizer.set_params(binary=True).fit_transform(x)? fit_transform、transform 和 TfidfVectorizer 的工作原理 - How fit_transform, transform and TfidfVectorizer works 在 piepline 中使用特征选择和 ML model 时,如何确保 sklearn piepline 应用 fit_transform 方法? - How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM