在 scikit-learn 管道中使用 Word2Vec

Question

我正在嘗試在這個數據樣本上運行 w2v

Statement              Label
Says the Annies List political group supports third-trimester abortions on demand.       FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.         TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran."""     TRUE
Health care reform legislation is likely to mandate free sex change surgeries.    FALSE
The economic turnaround started at the end of my term.     TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.    TRUE
Jim Dunnam has not lived in the district he represents for years now.    FALSE

使用此 GitHub 文件夾 (FeatureSelection.py) 中提供的代碼：

https://github.com/nishitpatel01/Fake_News_Detection

我想在我的 Naive Bayes model 中包含 word2vec 功能。 首先我考慮了 X 和 y 並使用了 train_test_split：

X = df['Statement']
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

dataset = pd.concat([X_train, y_train], axis=1)

這是我目前使用的代碼：

#Using Word2Vec 
with open("glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

training_sentences = DataPrep.train_news['Statement']

model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y): # what are X and y?
        return self

    def transform(self, X): # should it be training_sentences?
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])


"""
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
        return self
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])
"""

在classifier.py中，我正在運行

nb_pipeline = Pipeline([
        ('NBCV',FeaturesSelection.w2v),
        ('nb_clf',MultinomialNB())])

但是，這不起作用，我收到此錯誤：

TypeError                                 Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
      2 nb_pipeline = Pipeline([
      3         ('NBCV',FeaturesSelection.w2v),
----> 4         ('nb_clf',MultinomialNB())])

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    112         self.memory = memory
    113         self.verbose = verbose
--> 114         self._validate_steps()
    115 
    116     def get_params(self, deep=True):

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
    160                                 "transformers and implement fit and transform "
    161                                 "or be the string 'passthrough' "
--> 162                                 "'%s' (type %s) doesn't" % (t, type(t)))
    163 
    164         # We allow last estimator to be None as an identity transformation

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527,  0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....

我正在使用該文件夾中的所有程序，因此如果您使用它們，代碼可以重現。

如果您能向我解釋如何修復它以及代碼中需要進行哪些其他更改，那就太好了。 我的目標是將模型（朴素貝葉斯、隨機森林……）與 BoW、TF-IDF 和 Word2Vec 進行比較。

更新：

在下面的答案（來自伊斯梅爾）之后，我更新了代碼如下：

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

和

#building Linear SVM classfier
svm_pipeline = Pipeline([
        ('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
        ('svm_clf',svm.LinearSVC())
        ])

svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])

但是，我仍然遇到錯誤。

Answer 1

步驟 1. MultinomialNB FeaturesSelection.w2v是一個dict ，它沒有fit或fit_transform函數。 MultinomialNB也需要非負值，所以它不起作用。 所以我決定添加一個預處理階段來規范化負值。

from sklearn.preprocessing import MinMaxScaler

nb_pipeline = Pipeline([
        ('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
        ('nb_norm', MinMaxScaler()),
        ('nb_clf',MultinomialNB())
    ])

... 代替

nb_pipeline = Pipeline([
        ('NBCV',FeatureSelection.w2v),
        ('nb_clf',MultinomialNB())
    ])

第 2 步。我在word2vec.itervalues().next()上遇到錯誤。 因此，我決定使用與Word2Vec大小相同的預定義來更改尺寸形狀。

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

... 代替

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

在 scikit-learn 管道中使用 Word2Vec

問題描述

1 個解決方案

解決方案1
1 已采納 2021-01-01 22:46:28

在 scikit-learn 管道中使用 Word2Vec

問題描述

1 個解決方案

解決方案1 1 已采納 2021-01-01 22:46:28

解決方案1
1 已采納 2021-01-01 22:46:28