简体   繁体   中英

Text classification with LSTM Network and Keras 0.0% accuracy

I have csv file with two columns:

category, description

1030 categories in the file and only about 12,600 lines

I need to get a model for text classification, trained on this data. I use keras with LSTM model.

I found an article describing how to make a binary classification, and slightly modified it to use several categories.

My code:

import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from numpy import array
from keras.preprocessing.text import one_hot
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing import sequence 
import keras

df = pd.read_csv('/tmp/input_data.csv')

#one hot encode your documents

# integer encode the documents
vocab_size = 2000
encoded_docs = [one_hot(d, vocab_size) for d in df['description']]

def load_data_from_arrays(strings, labels, train_test_split=0.9):
    data_size = len(strings)
    test_size = int(data_size - round(data_size * train_test_split))
    print("Test size: {}".format(test_size))

    print("\nTraining set:")
    x_train = strings[test_size:]
    print("\t - x_train: {}".format(len(x_train)))
    y_train = labels[test_size:]
    print("\t - y_train: {}".format(len(y_train)))

    print("\nTesting set:")
    x_test = strings[:test_size]
    print("\t - x_test: {}".format(len(x_test)))
    y_test = labels[:test_size]
    print("\t - y_test: {}".format(len(y_test)))

    return x_train, y_train, x_test, y_test


encoder = LabelEncoder()
categories = encoder.fit_transform(df['category'])
num_classes = np.max(categories) + 1
print('Categories count: {}'.format(num_classes))
#Categories count: 1030

X_train, y_train, x_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)

# Truncate and pad the review sequences 

max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length) 

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

# Build the model 
embedding_vector_length = 32 
top_words = 10000

model = Sequential() 
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(num_classes, activation='softmax')) 
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_8 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_8 (Dense)              (None, 1030)              104030    
=================================================================
Total params: 477,230
Trainable params: 477,230
Non-trainable params: 0
_________________________________________________________________
None

#Train the model
model.fit(X_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64) 

Train on 10118 samples, validate on 2530 samples
Epoch 1/5
10118/10118 [==============================] - 60s 6ms/step - loss: 6.5086 - acc: 0.0019 - val_loss: 10.0911 - val_acc: 0.0000e+00
Epoch 2/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3281 - acc: 0.0028 - val_loss: 10.8270 - val_acc: 0.0000e+00
Epoch 3/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3120 - acc: 0.0024 - val_loss: 11.0078 - val_acc: 0.0000e+00
Epoch 4/5
10118/10118 [==============================] - 64s 6ms/step - loss: 6.2891 - acc: 0.0030 - val_loss: 11.8264 - val_acc: 0.0000e+00
Epoch 5/5
10118/10118 [==============================] - 69s 7ms/step - loss: 6.2559 - acc: 0.0032 - val_loss: 12.1625 - val_acc: 0.0000e+00

#Evaluate the model
scores = model.evaluate(x_test, y_test, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 0.00%

What mistake did I make when preparing the data? why accuracy is always 0?

I guess that your vocab_size is way too low. If you are dealing with usual text, try 10.000 - 100.000 as a starting point.

What one_hot does is to use the hashing trick . That means all of your words are hashed and projected into an 2000 vector space. It does not only mean that your dict is 2000 words long, it does mean every word will be projected to into this space, which effectively causes a lot of collisions, where words have the same index and are considered as equal in the LSTM.

Furthermore you should take a look at the transformed text, just too get an understanding of what happens here. To do so, build an reverse lookup and transform all the indices back.

As a further improvement it is feasible to preprocess the text with common techniques like stemming, normalizing etc. and the usage of a vocabulary or discard bag of words and use word embeddings.

from keras.preprocessing.text import one_hot, Tokenizer, hashing_trick

text1 = 'I love you'
text2 = 'you love I'

print('one_hot: ')
print(one_hot(text1, n=20))
print(one_hot(text2, n=20))
print('--------------------------------------')

print('Tokenizer: ')
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])
print(tokenizer.word_index)
print(tokenizer.index_word)
print('--------------------------------------')

print('hashing_trick: ')
print(hashing_trick(text1, n=20))
print(hashing_trick(text2, n=20))
print('--------------------------------------')

out:
one_hot: 
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Tokenizer: 
{'i': 1, 'love': 2, 'you': 3}
{1: 'i', 2: 'love', 3: 'you'}
--------------------------------------
hashing_trick: 
[14, 7, 14]
[14, 7, 14]
--------------------------------------

Run more times and you will find that the results of one_hot and hashing_trick are not unique. You should use Tokenizer to convert text.

I have curated end-to-end code with some inputs from my end and tested working on this data, you can use the same with your data with no or minimal changes as I have removed specifics and made it generic. Also at the end, I have highlighted what points I have worked on top of the code you provided above.

Code

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from nltk.tokenize import word_tokenize

def load_data_from_arrays(strings, labels, train_test_split=0.9):
    data_size = len(strings)
    test_size = int(data_size - round(data_size * train_test_split))
    print("Test size: {}".format(test_size))

    print("\nTraining set:")
    x_train = strings[test_size:]
    print("\t - x_train: {}".format(len(x_train)))
    y_train = labels[test_size:]
    print("\t - y_train: {}".format(len(y_train)))

    print("\nTesting set:")
    x_test = strings[:test_size]
    print("\t - x_test: {}".format(len(x_test)))
    y_test = labels[:test_size]
    print("\t - y_test: {}".format(len(y_test)))

    return x_train, y_train, x_test, y_test

# estimating the vocab length with the help of nltk
def get_vocab_length(strings):
    vocab = []
    for sent in strings:
        words = word_tokenize(sent)
        vocab.extend(words)
    vocab = list(set(vocab))
    vocab_length = len(vocab)
    return vocab_length

def clean_text(sent):
    
    # <your cleaning code here>
    # clean func 1
    # clean func 2
    # ...
    # clean func n

    return sent

# load input data
df = pd.read_csv('/tmp/input_data.csv')
strings = df['description'].values
labels = df['category'].values

clean_strings = [clean_text(sent) for sent in strings]

vocab_length = get_vocab_length(clean_strings)

# create onehot encodings of strings
encoded_docs = [one_hot(sent, vocab_length) for sent in strings]

# create onehot encodings of labels
ohe = OneHotEncoder()
categories = ohe.fit_transform(labels.reshape(-1,1)).toarray()

# split data
X_train, y_train, X_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)

# assuming max input to be not more than 512 words 
max_input_len = 512

# padding data
X_train = pad_sequences(X_train, maxlen=max_input_len, padding= 'post')
X_test = pad_sequences(X_test, maxlen=max_input_len, padding= 'post')

# setting embedding vector length
embedding_vector_length = 32

model = Sequential()
model.add(Embedding(vocab_length, embedding_vector_length, input_length=max_input_len, name= 'embedding') )
model.add(Flatten())
model.add(Dense(5, activation= 'softmax'))
model.compile('adam', loss= 'categorical_crossentropy', metrics= ['accuracy'])
model.summary()

# training the model
model.fit(X_train, y_train, epochs= 10, batch_size= 128, validation_split= 0.2, verbose= 1)

# evaluating the model
score = model.evaluate(X_test, y_test, verbose=0) 
print("Test Loss:", score[0])
print("Test Acc:", score[1])

Additional areas I have worked on

1. Text Cleaning

Created a function to clean the text. It is extremely important as it will remove unnecessary noise from the data and also note this step will totally depend on the type of data you have. To help you simplify, I have created a clean_text function in the above code where you can place your cleaning code. It should be used in such a way that it takes in raw text and provides clean text. Some of the libraries you may like to look into are re, string, and emoji.

2. Estimating Vocab Size

If you have enough data, it is good to estimate the vocab size rather than putting some number directly while passing it to Keras one_hot function. I have created a basic get_vocab_length function using nltk word_tokenize. You can use the same or enhance it further as per your data.

What Else?

You can work further on hyperparameter tuning and a few different neural network designs.

Final Words

It still may not work as it totally depends on the data quality and amount of data you have. There is a good chance you may not get results after trying everything if you have poor quality data or a very less amount of data.

I would then suggest you try transfer learning on some pre-trained models like BERT, RoBERTa, etc. HuggingFace provides good support for state-of-art pre-trained models, you can get started at the following links -

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM