I am trying to classify multiclass data at the token-level using scikit-learn. I already have a train
and test
split. The tokens occurs in batches of the same class, eg first 10 tokens belonging to class0
, the next 20 belonging to class4
and so on. The data is in the following \t
seperated format:
-----------------
token tag
-----------------
way 6
to 6
reduce 6
the 6
amount 6
of 6
traffic 6
....
public 2
transport 5
is 5
a 5
key 5
factor 5
to 5
minimize 5
....
The data is distributed as follows:
Training Data Test Data
# Total: 119490 29699
# Class 0: 52631 13490
# Class 1: 35116 8625
# Class 2: 17968 4161
# Class 3: 8658 2088
# Class 4: 3002 800
# Class 5: 1201 302
# Class 6: 592 153
The code I am trying is:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
if __name__ == '__main__':
# reading Files
train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'tag'], sep='\t').dropna().reset_index(drop=True)
test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'tag'], sep='\t')
# getting training and testing data
train_X = train_df['token']
test_X = test_df['token'].astype('U')
train_y = train_df['tag']
test_y = test_df['tag'].astype('U')
# Naive-Bayes
nb_pipeline = Pipeline([('vect', CountVectorizer()), # Counts occurrences of each word
('tfidf', TfidfTransformer()), # Normalize the counts based on document length
])
f1_list = []
cv = KFold(n_splits=5)
for train_index, test_index in cv.split(train_X):
train_text = train_X[train_index]
train_label = train_y[train_index]
val_text = train_X[test_index]
val_y = train_y[test_index]
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(random_state=42)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_label)
print("\nTraining Data Class Distribution:")
print(train_label.value_counts())
print("\nRe-sampled Training Data Class Distribution:")
print(train_y_res.value_counts())
# clf = SVC(kernel='rbf', max_iter=1000, class_weight='balanced', verbose=1)
clf = MultinomialNB()
# clf = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=100, tol=None,
# n_jobs=-1, verbose=1)
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(val_text))
f1 = f1_score(val_y, predictions, average='macro')
f1_list.append(f1)
print(f1_list)
pred = clf.predict(nb_pipeline.transform(test_X))
print('F1-macro: %s' % f1_score(pred, test_y, average='macro'))
I wanna build n-grams
and add that as a feature to the model so it can understand the context better but am not sure how that would work as testing would be done at the token level again. How can I build and feed the n-grams to the model and then predict at the token-level again for the test data?
Instead of:
nb_pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer())])
do at once counts, tfidf for unigrams and bigrams:
from sklearn.feature_extraction.text import TfidfVectorizer
nb_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 2)))])
See docs for more.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.