Gensim上的問題從字典創建語料庫

Question

我是 Gensim 的新手，我正在學習 Gensim，並按照此處的示例進行操作： https://www.machinelearningplus.com/nlp/gensim-tutorial/

我不確定從字典創建語料庫的最后一行。 創建字典時，我們已經使用 simple_preprocess 逐行處理“文檔”。 我在使用字典創建語料庫時在想，我們需要再次使用 simple_preprocess 來逐行處理“文檔”。 那是多余的嗎？

documents = ["This is the first line",
         "This is the second sentence",
         "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
# Why need to use simple_preprocess and pass the documents again while
# the last call already created the dictionary using simple_preporcess on documents
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

謝謝，

亞歷克斯

Answer 1

Dictionary object 將語料庫中的每個單詞映射到一個唯一的 id，而doc2bow()根據提供的字典創建一個詞袋 (BoW) model。

在我看來，最好為CountVectorizer使用來自 Sci-kit learn 的 CountVectorizer，因為它帶有一些在 Gensim 的實現中不存在的有用參數，例如min_df和max_df （參見此處）。

Gensim上的問題從字典創建語料庫

問題描述

1 個解決方案

解決方案1
1 2020-06-21 13:59:17

Gensim上的問題從字典創建語料庫

問題描述

1 個解決方案

解決方案1 1 2020-06-21 13:59:17

解決方案1
1 2020-06-21 13:59:17