如何在兩個步驟中使用TfidfVectorizer，增加分析文本的數量？

Question

我正在使用sklearn在Python3中處理文本分類問題。

我正在執行以下步驟：

清理所有文本以訓練分類器
使用TfidfVectorizer提取訓練文本的特征並進行矢量化
生成分類器（RandomForestClassifier）

這很有效，現在當我得到一個我想要分類的新文本時，處理它的最佳方法是什么？ 我知道Tfidf方法也會查看其他數據集中特征的出現，這就是我現在將TfidfVectorizer應用於舊數據集+新文本的原因。 但有沒有辦法以一種漸進的方式做到這一點？ 因此，一旦訓練設置它不再被觸及。 那會有意義嗎？

預先感謝您的幫助！ 盧卡

Answer 1

矢量化器適合您傳遞的文檔語料庫。 通常，如果您正在處理大量文檔，則首先將向量化程序適合整個語料庫。 這允許矢量化器正確地斷言文檔中術語的頻率，並適當地應用min_df ， max_df和max_features參數。 一旦矢量化器適合，您就可以簡單地轉換文檔以提取tfidf向量。 （本文檔不必在培訓語料庫中）

例如：

from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction import TfidfVectorizer

class Tokenizer(object):
    def __init__(self):
        self.stemmer = PorterStemmer()
    def __call__(self, doc):
        return [self.stemmer.stem(w) for w in word_tokenize(doc)]
tfidf = TfidfVectorizer(stop_words='english', max_features=500, lowercase=True, tokenizer=Tokenizer)
# raw_docs can be collection of documents i.e. list, generator, etc...
raw_docs = ['The quick red fox jumped over the lazy brown dog', 'Carlos made a quick jumping catch in the game last night', 'How much wood would a woodchuck chuck if a woodchuck could chuck wood']
tfidf.fit(raw_docs[:1])
tfidf.vocabulary_
{'quick': 5, 'red': 6, 'fox': 2, 'jump': 3, 'lazi': 4, 'brown': 0, 'dog': 1}
# Notice only the first sentence in vocab
tfidf.transform(raw_docs[1:2]).todense()
matrix([[0.        , 0.        , 0.        , 0.70710678, 0.        ,
         0.70710678, 0.        ]])
#Vectorizing the second sentence only gives scores for 'jump' and 'quick'
tfidf.fit(raw_docs)
tfidf.vocabulary_
{'quick': 10,
 'red': 11,
 'fox': 5,
 'jump': 7,
 'lazi': 8,
 'brown': 0,
 'dog': 4,
 'carlo': 1,
 'catch': 2,
 'game': 6,
 'night': 9,
 'wood': 12,
 'woodchuck': 13,
 'chuck': 3}
# Notice terms from each sentence now
matrix([[0.        , 0.44036207, 0.44036207, 0.        , 0.        ,
         0.        , 0.44036207, 0.3349067 , 0.        , 0.44036207,
         0.3349067 , 0.        , 0.        , 0.        ]])
# We now have twice the features 14 v 7 and the vector catches each of the terms in the sentence.

如何在兩個步驟中使用TfidfVectorizer，增加分析文本的數量？

問題描述

1 個解決方案

解決方案1
0 已采納 2019-06-06 15:56:52

如何在兩個步驟中使用TfidfVectorizer，增加分析文本的數量？

問題描述

1 個解決方案

解決方案1 0 已采納 2019-06-06 15:56:52

解決方案1
0 已采納 2019-06-06 15:56:52