简体   繁体   中英

How to classify single text using classifier algorithms

I have set of documents which is clustered. Now each document has a label. I wanted to build a classifier based on this, train and test it so it works fine and falls into a proper cluster if I give a new document/text. So I used countVectorizer to transform the documents into features. I know this countVectorizer will take unique sets of all words in the set of documents (more than 1000 doc) that I have provided. Now I make a classifier either KNN OR NavieBayes and now I have a new text file or document which i need to transform it into features. But if i give a single documnet to the countVectorizer I will have only few words and based on that the whole features will be different than the training and test documents, which will certainly give a worng result. How do I use the same countVectorizer Object for very document I give is there any way. Kindly guide me, any suggestions or a way to do this???

    def classifierNaviaBayes(self):
    count_vectorizer = CountVectorizer(binary='true')
    train_documents = count_vectorizer.fit_transform(self.training_documents)
    classifier = BernoulliNB().fit(train_documents, self.training_labels)

    "Test Phase"
    count_worng_prediction = 0
    for i in range(0,len(self.test_documents)):
        print("The predicted value is ",classifier.predict(count_vectorizer.transform([self.test_documents[i]])))
        print("The expected value is ", self.test_labels[i])
        predicted_result = classifier.predict(count_vectorizer.transform([self.test_documents[i]]))[0]
        expected_result = self.test_labels[i]
        if predicted_result != expected_result:
            count_worng_prediction +=1

    print("The percentage of prediction accuracy is ",(100-(count_worng_prediction/len(self.test_documents))*100))

I am using the same countVertorizer for test data as well, and hence the below code is working.

Using CountVectorizer.transform is the right way to classify test documents. The new vocabulary in your test set will not be used when you transform the test set using the vectorizer fitted on training data. (Fitting the vectorizer will not make any sense as the model was trained on a different vocabulary)

You can read more on how to fit sparse features here

Here I use a data with two columns: text and review, to classify the text. The text column contains sentences/phrases. The review column can contain good, bad, or neutral.

TF-IDF feature vector has been used to create features.

Naive bayes, logistic regression, random forest, neural network, and LSTM have been used to build the classifiers.

Here I showed the basic steps of developing various algorithms for classifying sentences. To improve accuracy, more parameter tweaking needs to be done.

The code has been developed using Python language and Jupyter notebook.

import keras
import sklearn
import xgboost
import textblob
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer
import tensorflow.keras.preprocessing
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import *
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import re
nltk.download('stopwords')
from nltk.corpus import stopwords
import  networkx
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from  keras import layers, models, optimizers
from keras.wrappers.scikit_learn import KerasClassifier

#Read data from a csv file. The file contains two columns, first column is the text column containing sentences and second column is the class or target
#The problem is to build classifiers that would learn the sentences and its corresponding class
#Then use the model to predict the class of a new test sentence(s)
doc = pd.read_csv("C:\\data.csv")
print("The head of the file looks as below:")
doc.head()

The head of the file looks as below:

                    text                                review
0   the laptop is good but it hangs                     bad
1   this tv is very fast in changing channels           good
2   the radio sound quality is same as the tv sound     neutral
3   i dont know the quality of this new radio           neutral
4   the laptop runs faster with 8 gb ram                good

#split the dataset into training and validation 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(doc['text'], doc['review'], train_size=.6, stratify=doc['review'])

#target column can have bad, good, and neutral. 
#label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create tf-idf feature vector. Word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(doc['text'])
xtrain_tfidf_word =  tfidf_vect.transform(train_x)
xvalid_tfidf_word =  tfidf_vect.transform(valid_x)

#train various ML models
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)  
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    return metrics.accuracy_score(predictions, valid_y)


# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_word, train_y, xvalid_tfidf_word)
print ("NB, WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_word, train_y, xvalid_tfidf_word)
print ("LR, WordLevel TF-IDF: ", accuracy)

# RF on Word Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf_word, train_y, xvalid_tfidf_word)
print ("RF, WordLevel TF-IDF: ", accuracy)

# Extereme Gradient Boosting on Word Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_word.tocsc(), train_y, xvalid_tfidf_word.tocsc())
print ("Xgb, WordLevel TF-IDF: ", accuracy)


#encode the target string to integers using the scikitlearn class LabelEncoder. Then convert the vector of integers to a one hot encoding using the keras function to categorical
from sklearn.preprocessing import LabelEncoder
from keras import utils as np_utils
from keras.utils import to_categorical

#encode class value as integer
encoder_train = LabelEncoder()
encoder_train.fit(train_y)
encoded_train_target = encoder_train.transform(train_y)
dummy_target_train = np_utils.to_categorical(encoded_train_target)

encoder_valid = LabelEncoder()
encoder_valid.fit(valid_y)
encoded_valid_target = encoder_train.transform(valid_y)
dummy_target_valid = np_utils.to_categorical(encoded_valid_target)

cvec = CountVectorizer(stop_words='english')
xx = cvec.fit(doc['text'])
dummyall_x = pd.DataFrame(cvec.transform(doc['text']).todense(), columns=cvec.get_feature_names())

#Accuracy of NB, LR, RF, and Xgb models:
#NB, WordLevel TF-IDF:  0.675
#LR, WordLevel TF-IDF:  0.675
#RF, WordLevel TF-IDF:  0.575
#Xgb, WordLevel TF-IDF:  0.65

#Basic neural network

vectorizer = CountVectorizer(binary=True, stop_words=stopwords.words('english'), lowercase=True, min_df=1, max_df=0.9, max_features=5000)
X_train_onehot = vectorizer.fit_transform(train_x)

def baseline_model():
    model = Sequential()
    model.add(Dense(units = 10, activation = 'relu', input_dim = len(vectorizer.get_feature_names())))
    model.add(Dense(units=3, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = baseline_model()
model.fit(X_train_onehot[:-3], dummy_target_train[:-3], epochs=2, batch_size=128, shuffle=True, validation_split = 0.1, verbose=1, validation_data=(X_train_onehot[-3:], dummy_target_valid[-3:]))
scores = model.evaluate(vectorizer.transform(valid_x), dummy_target_valid, verbose=1)
print("Accuracy:", scores[1])

#Accuracy of neural network model:   
#Accuracy: 0.25


#LSTM and CNN. 
#Sequence data have a 1 d spatial structure. Using CNN may help to pick out invariant features for the target. 
#The learned CNN spatial features are then Learned as sequences by the LSTM.


#MAX NO OF WORDS TO BE USED
MAX_NB_WORDS=50000
#max no of words in each complaint
MAX_SEQUENCE_LENGTH=250
EMBEDDING_DIM= 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(doc['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = tokenizer.texts_to_sequences(doc['text'].values)
X = keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Y = pd.get_dummies(doc['review']).values
print('Shape of label tensor:', Y.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
#for CNN
model.add(tensorflow.keras.layers.SpatialDropout1D(0.2))
model.add(Conv1D(filters=100, kernel_size=10, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

#model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 2
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[tensorflow.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();

plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();

#Accuracy of LSTM
#Accuracy: 0.400

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM