简体   繁体   中英

Classification with n-grams

I want to use a sklearn classifier using n-gram features. Furthermore, I want to do cross-validation to find the best order of the n-grams. However, I am a bit stuck on how I can fit all the pieces together.

For now, I have the following code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text = ... # This is the input text. A list of strings
labels = ... # These are the labels of each sentence
# Find the optimal order of the ngrams by cross-validation
scores = pd.Series(index=range(1,6), dtype=float)
folds = KFold(n_splits=3)

for n in range(1,6):
    count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english')
    X = count_vect.fit_transform(text)
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
    clf = MultinomialNB()
    score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1)
    scores.loc[n] = np.mean(score)

# Evaluate the classifier using the best order found
order = scores.idxmax()
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf = clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print('Accuracy is {}'.format(acc))

However, I feel like this is the wrong way to do it, since I create a train-test split in every loop.

If a do a train-test split beforehand and apply the CountVectorizer to both parts separately, than these parts have different shape s, which causes problems when using clf.fit and clf.score .

How can I solve this?

EDIT: If I try to build a vocabulary first, I still have to build several vocabularies, since the vocabulary for unigrams is different from that of bigrams, etc.

To give an example:

# unigram vocab
vocab = set()
for sentence in text:
    for word in sentence:
        if word not in vocab:
            vocab.add(word)
len(vocab) # 47291

# bigram vocab
vocab = set()
for sentence in text:
    bigrams = nltk.ngrams(sentence, 2)
    for bigram in bigrams:
        if bigram not in vocab:
            vocab.add(bigram)
len(vocab) # 326044

This again leads me to the same problem of needing to apply the CountVectorizer for every n-gram size.

You need to set the vocabulary parameter first. In some way you have to provide the entire vocabulary, otherwise the dimensions can never match (obviously). If you do the train/test split first, there might be words in one set which are not present in the other and there you get your dimension mismatch.

The documentation says:

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

Further down you'll find a description for vocabulary .

vocabulary :
Mapping or iterable, optional Either a Mapping (eg, a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM