I am building a program that assigns multiple labels/tags to textual descriptions. I am using Scikit-Learn's OneVsRestClassifier+XGBClassifier to classify the vectorized textual descriptions. I am using Gensim's Word2Vec to vectorize the texts. However, when I try to fit the classifier to the vectorized data, I get the following error:
IndexError: tuple index out of range
Below is my code (the error happens on the last line where I try to fit the classifier):
w2vModel = Word2Vec(sentences, size=150, window=10, min_count=2, workers=multiprocessing.cpu_count())
modelCorpus = list(w2vModel.wv.vocab)
descriptions = []
for sentence in sentences:
wordList = []
for word in sentence:
if (word in modelCorpus):
wordList.append(w2vModel.wv[word])
descriptions.append(np.concatenate(wordList))
x = np.array(descriptions)
# Vectorize ticket labels/tags using MultiLabelBinarizer
tagList = relevantDF.Tags # Retrieve list of tags
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)
# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
yTrain = csr_matrix(yTrain).toarray()
# Fit OneVsRestClassifier w/ XGBClassifier
clf = OneVsRestClassifier(XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.003))
clf.fit(xTrain, yTrain)
The shape of x is: (8347,)
The shape of y is: (8347, 24)
The shape of xTrain is: (6677,)
The shape of yTrain is: (6677, 24)
My guess is that you do not have a fixed number of features per sample. A longer sentence add more word vectors to descriptions than a shorter one. Instead of concatenation word vectors, you could average them:
descriptions.append(np.mean( np.array(wordList), axis=0 ))
Or you have a look at Doc2Vec , which produces a vector per sentence.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.