矢量化fit_transform如何在sklearn中工作？

Question

我正在嘗試理解以下代碼

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

當我嘗試打印X以查看將返回的內容時，我得到了以下結果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是，我不明白這個結果的含義？

Answer 1

您可以將其解釋為“（sentence_index，feature_index）count”

因為有3個句子：它從0開始到2結束

特征索引是你可以從vectorizer獲得的單詞索引。詞匯_

- >詞匯_詞典{word：feature_index，...}

所以對於例子（0,1）1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

而不是計數矢量化器，如果你使用tfidf矢量化器，請看這里它將給你tfidf值。 我希望我說清楚

Answer 2

正如@Himanshu所寫，這是一個“（sentence_index，feature_index）計數”

這里，計數部分是“單詞出現在文檔中的次數”

例如，

（0,1）1

（0,2）1

（0,6）1

（0,3）1

（0,8）1

（1,5）2僅對於此示例，計數“2”表示在本文檔/句子中單詞“和”出現兩次

（1,1）1

（1,6）1

（1,3）1

（1,8）1

（2,4）1

（2,7）1

（2,0）1

（2,6）1

（3,1）1

（3,2）1

（3,6）1

（3,3）1

（3,8）1

讓我們改變代碼中的語料庫。 基本上，我在語料庫列表的第二句中添加了兩次“第二”一詞。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

（0,1）1

（0,2）1

（0,6）1

（0,3）1

（0,8）1

（1,5）4對於修改后的語料庫，計數“4”表示在本文/句中單詞“second”出現4次

（1,1）1

（1,6）1

（1,3）1

（1,8）1

（2,4）1

（2,7）1

（2,0）1

（2,6）1

（3,1）1

（3,2）1

（3,6）1

（3,3）1

（3,8）1

Answer 3

它將文本轉換為數字。 因此，使用其他功能，您將能夠計算給定數據集中每個單詞存在多少次。 我是編程新手，所以也許還有其他領域可供使用。

矢量化fit_transform如何在sklearn中工作？

問題描述

3 個解決方案

解決方案1
2 2019-01-16 07:12:01

解決方案2
1 2019-01-24 16:07:01

解決方案3
0 2018-09-13 14:11:40

矢量化fit_transform如何在sklearn中工作？

問題描述

3 個解決方案

解決方案1 2 2019-01-16 07:12:01

解決方案2 1 2019-01-24 16:07:01

解決方案3 0 2018-09-13 14:11:40

解決方案1
2 2019-01-16 07:12:01

解決方案2
1 2019-01-24 16:07:01

解決方案3
0 2018-09-13 14:11:40