簡體 English 中英

Python Scikit學習：TF-IDF中的空詞匯表

[英]Python Scikit-learn: Empty Vocabulary in TF-IDF

原文 2013-05-22 01:53:42 4 1 python/ scipy/ scikit-learn/ tf-idf

我正在使用該問題的最高投票答案（兩個文本文檔之間的相似性）給出的代碼來計算文檔之間的TF-IDF。 但是，我觀察到，當我運行代碼時未指定min_df的自定義值（在代碼中為1）時，如果兩個文檔完全不同（這樣它們中就沒有共同的詞），而不是接收TF- IDF值為0，我得到以下錯誤：

ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).

有人可以告訴我如何擺脫這個錯誤嗎？

1 個解決方案

默認情況下（在sklearn <= 0.13中）， min_df設置為min_df=2 ，這意味着每個單詞必須至少出現在語料庫的2個不同文檔中，才能包含在矢量化程序的詞匯表中。盡管這對於大型語料庫來說是一個合理的選擇，但要獲得包含在玩具數據集中的任何內容（只有幾句話）的限制都太嚴格了，因此您得到的錯誤消息非常明確。 在scikit-learn的開發分支中，將min_df=2默認值更改為min_df=1 ，以減少對嘗試使用玩具數據集上具有默認參數值的庫的新用戶的困惑。

Scikit-TF-IDF空詞匯

[英]Scikit - TF-IDF empty vocabulary

在scikit-learn tf-idf矩陣中獲取文檔名稱

[英]Get the document name in scikit-learn tf-idf matrix

scikit-learn中TF-IDF向量的組特征

[英]Group features of TF-IDF vector in scikit-learn

scikit-learn - 我應該使用TF或TF-IDF模型嗎？

[英]scikit-learn - Should I fit model with TF or TF-IDF?

使用scikit-learn和hand計算的tf-idf矩陣值的差異

[英]Difference in values of tf-idf matrix using scikit-learn and hand calculation

查找Tf-Idf使用scikit-learn從文檔集中僅選擇單詞的分數

[英]Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

TF-IDF簡單使用 - NLTK / Scikit Learn

[英]TF-IDF Simple Use - NLTK/Scikit Learn

在Gensim中為我的詞匯計算tf-idf

[英]Calculate tf-idf in Gensim for my vocabulary

scikit學習中的TD / IDF

[英]TD/IDF in scikit-learn

Scikit Learn TfidfVectorizer：如何獲得具有最高 tf-idf 分數的前 n 個術語

[英]Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 Scikit-TF-IDF空詞匯在scikit-learn tf-idf矩陣中獲取文檔名稱 scikit-learn中TF-IDF向量的組特征 scikit-learn - 我應該使用TF或TF-IDF模型嗎？使用scikit-learn和hand計算的tf-idf矩陣值的差異查找Tf-Idf使用scikit-learn從文檔集中僅選擇單詞的分數 TF-IDF簡單使用 - NLTK / Scikit Learn 在Gensim中為我的詞匯計算tf-idf scikit學習中的TD / IDF Scikit Learn TfidfVectorizer：如何獲得具有最高 tf-idf 分數的前 n 個術語

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM