简体   繁体   中英

multi-class text classification - training classifier with TF/IDF vectorizer

I am fairly new to to NLP, but now we were presented with a multi-class text classification task in class. The data set contains the first pages of documents and these documents need to be classified into 1 or more of 24 topics. Every document text is a row in a table.

I tried to implement a TD/IDF vorctorizer, and now the classifier returns the error "ValueError: setting an array element with a sequence."

Even though it is trying to tell me exactly what is wrong, I cant't wrap my hand around it and I'm not sure what to do and if my TF/IDF is correct. This got me thinkin, the vectorizer will produce a different amount of columns in the matrix for each entry, is that correct? How can a classifier work with that?

Here is my code:

import pandas as pd
import sklearn.model_selection as ms
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

X_train = pd.read_csv('train_values.csv', nrows=3, delimiter=',', engine='c')

#tokenize 
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
X_train['doc_text'] = X_train['doc_text'].apply(lambda x: tokenizer.tokenize(x.lower()))

#remove words
stopwords = set(stopwords.words('english'))
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords]
    return words

#convert to list - otherwise vectorizer returns "'list' object has no attribute 'lower'"
X_train_list = X_train['doc_text'].tolist()

# compute TF/IDF
from sklearn.feature_extraction.text import TfidfVectorizer

X_train_n = []

for i in X_train_list:  
    vectorizer=TfidfVectorizer(use_idf=True)
    fitted_vectorizer=vectorizer.fit(i)
    vectorizer_vectors=fitted_vectorizer.transform(i)
    X_train_n.append(vectorizer_vectors)

y_train = pd.read_csv('train_labels.csv', nrows=3, delimiter=',', engine='c')
y_train_n = y_train.drop('row_id', axis=1)

y_train_n = np.array(y_train_n.as_matrix(columns = None), dtype=bool).astype(np.int) # I tried this as a test

#build classifier
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train_n, y_train_n)

  • How can I use a classifier with this vectorizer? Or is my implementation of the vectorizer wrong?

Any help is greatly appreciated.

This link explains why you would be getting that specific error.

So check the shape of your numpy arrays and make sure they are the shape they are supposed to be

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM