简体   繁体   中英

Sklearn classifier can't be trained with Gensim Word2Vec data

I am building a program that assigns multiple labels/tags to textual descriptions. I am using Scikit-Learn's OneVsRestClassifier+XGBClassifier to classify the vectorized textual descriptions. I am using Gensim's Word2Vec to vectorize the texts. However, when I try to fit the classifier to the vectorized data, I get the following error:

IndexError: tuple index out of range

Below is my code (the error happens on the last line where I try to fit the classifier):

w2vModel = Word2Vec(sentences, size=150, window=10, min_count=2, workers=multiprocessing.cpu_count())
modelCorpus = list(w2vModel.wv.vocab)

descriptions = []
for sentence in sentences:
    wordList = []
    for word in sentence: 
        if (word in modelCorpus):
            wordList.append(w2vModel.wv[word])
    descriptions.append(np.concatenate(wordList))

x = np.array(descriptions)

# Vectorize ticket labels/tags using MultiLabelBinarizer
tagList = relevantDF.Tags # Retrieve list of tags
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)

# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
yTrain = csr_matrix(yTrain).toarray()

# Fit OneVsRestClassifier w/ XGBClassifier
clf = OneVsRestClassifier(XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.003))
clf.fit(xTrain, yTrain)

The shape of x is: (8347,)

The shape of y is: (8347, 24)

The shape of xTrain is: (6677,)

The shape of yTrain is: (6677, 24)

My guess is that you do not have a fixed number of features per sample. A longer sentence add more word vectors to descriptions than a shorter one. Instead of concatenation word vectors, you could average them:

descriptions.append(np.mean( np.array(wordList), axis=0 ))

Or you have a look at Doc2Vec , which produces a vector per sentence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM