Sklearn + Gensim：如何使用Gensim的Word2Vec嵌入进行Sklearn文本分类

Question

I am building a multilabel text classification program and I am trying to use OneVsRestClassifier+XGBClassifier to classify the text. 我正在构建一个多标签文本分类程序，并且尝试使用OneVsRestClassifier + XGBClassifier对文本进行分类。 Initially I used Sklearn's Tf-Idf Vectorization to vectorize the texts, which worked without error. 最初，我使用Sklearn的Tf-Idf矢量化对文本进行矢量化，而没有错误。 Now I am using Gensim's Word2Vec to vectorize the texts. 现在，我正在使用Gensim的Word2Vec对文本进行矢量化处理。 When I feed the vectorized data into the OneVsRestClassifier+XGBClassifier however, I get the following error on the line where I split the test and training data: 但是，当我将向量化的数据输入到OneVsRestClassifier + XGBClassifier中时，我在分割测试和训练数据的那一行上出现以下错误：

TypeError: Singleton array array(, dtype=object) cannot be considered a valid collection. TypeError：单例数组array（，dtype = object）不能被视为有效集合。

I have tried converting the vectorized data into a feature array (np.array), but that hasn't seemed to work. 我尝试将向量化的数据转换为特征数组（np.array），但这似乎不起作用。 Below is my code: 下面是我的代码：

x = np.array(Word2Vec(textList, size=120, window=6, min_count=5, workers=7, iter=15))

vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)

# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)

The variables textList and tagList are a list of strings (textual descriptions I am trying to classify). 变量textList和tagList是字符串列表（我要分类的文本描述）。

Answer 1

x here becomes a numpy array conversion of the gensim.models.word2vec.Word2Vec object -- it is not actually the word2vec representations of textList that are returned. x在这里成为gensim.models.word2vec.Word2Vec对象的numpy数组转换-返回的实际上不是textList的word2vec表示形式。

Presumably, what you want to return is the corresponding vector for each word in a document (for a single vector representing each document, it would be better to use Doc2Vec ). 大概，您要返回的是文档中每个单词的对应向量（对于代表每个文档的单个向量，最好使用Doc2Vec ）。

For a set of documents in which the most verbose document contains n words, then, each document would be represented by an n * 120 matrix. 对于最冗长的文档包含n单词的一组文档，则每个文档将由n * 120矩阵表示。

Unoptimized code for illustrative purposes: 出于说明目的未优化的代码：

import numpy as np

model = x = Word2Vec(textList, size=120, window=6, 
                               min_count=5, workers=7, iter=15)

documents = []
for document in textList:
    word_vectors = []
    for word in document.split(' '): # or your logic for separating tokens
        word_vectors.append(model.wv[word])
    documents.append(np.concatenate(word_vectors))

# resulting in an n * 120 -- that is, `Word2Vec:size`-- array
document_matrix = np.concatenate(documents)

Sklearn + Gensim：如何使用Gensim的Word2Vec嵌入进行Sklearn文本分类

问题描述

1 个解决方案

解决方案1
1 2019-08-21 21:57:52

Sklearn + Gensim：如何使用Gensim的Word2Vec嵌入进行Sklearn文本分类

问题描述

1 个解决方案

解决方案1 1 2019-08-21 21:57:52

解决方案1
1 2019-08-21 21:57:52