简体   繁体   中英

Building n-grams for token level text classification

I am trying to classify multiclass data at the token-level using scikit-learn. I already have a train and test split. The tokens occurs in batches of the same class, eg first 10 tokens belonging to class0 , the next 20 belonging to class4 and so on. The data is in the following \t seperated format:

-----------------
token       tag
-----------------
way          6
to           6
reduce       6
the          6
amount       6
of           6
traffic      6
   ....
public       2
transport    5
is           5
a            5
key          5
factor       5
to           5 
minimize     5
   ....

The data is distributed as follows:

                              Training Data                    Test Data
# Total:                        119490                          29699
# Class 0:                      52631                           13490
# Class 1:                      35116                           8625
# Class 2:                      17968                           4161
# Class 3:                      8658                            2088
# Class 4:                      3002                            800
# Class 5:                      1201                            302
# Class 6:                      592                             153

The code I am trying is:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE

if __name__ == '__main__':
    # reading Files
    train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'tag'], sep='\t').dropna().reset_index(drop=True)
    test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'tag'], sep='\t')

    # getting training and testing data
    train_X = train_df['token']
    test_X = test_df['token'].astype('U')
    train_y = train_df['tag']
    test_y = test_df['tag'].astype('U')

    # Naive-Bayes
    nb_pipeline = Pipeline([('vect', CountVectorizer()),        # Counts occurrences of each word
                            ('tfidf', TfidfTransformer()),      # Normalize the counts based on document length
                            ])
    f1_list = []
    cv = KFold(n_splits=5)
    for train_index, test_index in cv.split(train_X):
        train_text = train_X[train_index]
        train_label = train_y[train_index]
        val_text = train_X[test_index]
        val_y = train_y[test_index]
        vectorized_text = nb_pipeline.fit_transform(train_text)
        sm = SMOTE(random_state=42)
        train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_label)
        print("\nTraining Data Class Distribution:")
        print(train_label.value_counts())
        print("\nRe-sampled Training Data Class Distribution:")
        print(train_y_res.value_counts())
        # clf = SVC(kernel='rbf', max_iter=1000, class_weight='balanced', verbose=1)
        clf = MultinomialNB()
        # clf = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=100, tol=None,
        #                    n_jobs=-1, verbose=1)
        clf.fit(train_text_res, train_y_res)
        predictions = clf.predict(nb_pipeline.transform(val_text))
        f1 = f1_score(val_y, predictions, average='macro')
        f1_list.append(f1)
    print(f1_list)
    pred = clf.predict(nb_pipeline.transform(test_X))
    print('F1-macro: %s' % f1_score(pred, test_y, average='macro'))

I wanna build n-grams and add that as a feature to the model so it can understand the context better but am not sure how that would work as testing would be done at the token level again. How can I build and feed the n-grams to the model and then predict at the token-level again for the test data?

Instead of:

nb_pipeline = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer())])

do at once counts, tfidf for unigrams and bigrams:

from sklearn.feature_extraction.text import TfidfVectorizer
nb_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 2)))])

See docs for more.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM