矢量化fit_transform如何在sklearn中工作？

Question

我正在尝试理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印X以查看将返回的内容时，我得到了以下结果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是，我不明白这个结果的含义？

Answer 1

您可以将其解释为“（sentence_index，feature_index）count”

因为有3个句子：它从0开始到2结束

特征索引是你可以从vectorizer获得的单词索引。词汇_

- >词汇_词典{word：feature_index，...}

所以对于例子（0,1）1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

而不是计数矢量化器，如果你使用tfidf矢量化器，请看这里它将给你tfidf值。 我希望我说清楚

Answer 2

正如@Himanshu所写，这是一个“（sentence_index，feature_index）计数”

这里，计数部分是“单词出现在文档中的次数”

例如，

（0,1）1

（0,2）1

（0,6）1

（0,3）1

（0,8）1

（1,5）2仅对于此示例，计数“2”表示在本文档/句子中单词“和”出现两次

（1,1）1

（1,6）1

（1,3）1

（1,8）1

（2,4）1

（2,7）1

（2,0）1

（2,6）1

（3,1）1

（3,2）1

（3,6）1

（3,3）1

（3,8）1

让我们改变代码中的语料库。 基本上，我在语料库列表的第二句中添加了两次“第二”一词。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

（0,1）1

（0,2）1

（0,6）1

（0,3）1

（0,8）1

（1,5）4对于修改后的语料库，计数“4”表示在本文/句中单词“second”出现4次

（1,1）1

（1,6）1

（1,3）1

（1,8）1

（2,4）1

（2,7）1

（2,0）1

（2,6）1

（3,1）1

（3,2）1

（3,6）1

（3,3）1

（3,8）1

Answer 3

它将文本转换为数字。 因此，使用其他功能，您将能够计算给定数据集中每个单词存在多少次。 我是编程新手，所以也许还有其他领域可供使用。

矢量化fit_transform如何在sklearn中工作？

问题描述

3 个解决方案

解决方案1
2 2019-01-16 07:12:01

解决方案2
1 2019-01-24 16:07:01

解决方案3
0 2018-09-13 14:11:40

矢量化fit_transform如何在sklearn中工作？

问题描述

3 个解决方案

解决方案1 2 2019-01-16 07:12:01

解决方案2 1 2019-01-24 16:07:01

解决方案3 0 2018-09-13 14:11:40

解决方案1
2 2019-01-16 07:12:01

解决方案2
1 2019-01-24 16:07:01

解决方案3
0 2018-09-13 14:11:40