如何將 SelectKBest 合並到 SKlearn 管道中

Question

我正在嘗試使用 sklearn 構建文本分類器。 這個想法是：

使用TfidfVectorizer 向量化訓練語料庫
Select 使用SelectKBest產生的前 20,000 個特征（或者如果結果數量低於 20k，則使用所有特征）
將這些特征輸入邏輯回歸分類器

我已經成功設置如下：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(df_train["input"])
selector = SelectKBest(f_classif, k=min(20000, x_train.shape[1]))
selector.fit(x_train, df_train["label"].values)
x_train = selector.transform(x_train)
classifier = LogisticRegression()
classifier.fit(x_train, df_train["label"])

我現在想將所有這些打包到一個管道中，並共享該管道，以便其他人可以將其用於他們自己的文本數據。 然而，我不知道如何讓 SelectKBest 實現與上面相同的行為，即接受 min(20000, n_features from vectorizer output) 作為 k。 如果我將其簡單地保留為 k=20000，如下所示，當擬合具有少於 20k 個矢量化特征的新語料庫時，管道將不起作用（引發錯誤）。

pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",SelectKBest(f_classif, k=20000)),
            ("clf",LogisticRegression())])

Answer 1

正如@vivek kumar 指出的那樣，您需要覆蓋SelectKBest的_check_params方法並將您的邏輯添加到其中，如下所示：

class MySelectKBest(SelectKBest):
    def _check_params(self, X, y):
        if (self.k >= X.shape[1]):
            warnings.warn("Less than %d number of features found, so setting k as %d" % (self.k, X.shape[1]),
                      UserWarning)
            self.k = X.shape[1]
        if not (self.k == "all" or 0 <= self.k):
            raise ValueError("k should be >=0, <= n_features = %d; got %r. "
                             "Use k='all' to return all features."
                             % (X.shape[1], self.k))

如果找到的功能數量少於設置的閾值，我還設置了警告。 現在讓我們看一個相同的工作示例：

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import warnings

categories = ['alt.atheism', 'comp.graphics',
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
              'comp.windows.x', 'misc.forsale', 'rec.autos']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target

# newsgroups result in 47K odd features after performing TFIDF vectorizer

# Case 1: When K < No. of features - the regular case
pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",MySelectKBest(f_classif, k=30000)),
            ("clf",LogisticRegression())])

pipe.fit(newsgroups.data, y_true)
pipe.score(newsgroups.data, y_true)
#0.968

#Case 2: When K > No. of cases - the one with an issue

pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",MySelectKBest(f_classif, k=50000)),
            ("clf",LogisticRegression())])

pipe.fit(newsgroups.data, y_true)
UserWarning: Less than 50000 number of features found, so setting k as 47407

pipe.score(newsgroups.data, y_true)
#0.9792

希望這可以幫助！

如何將 SelectKBest 合並到 SKlearn 管道中

問題描述

1 個解決方案

解決方案1
0 已采納 2020-06-13 08:23:47

如何將 SelectKBest 合並到 SKlearn 管道中

問題描述

1 個解決方案

解決方案1 0 已采納 2020-06-13 08:23:47

解決方案1
0 已采納 2020-06-13 08:23:47