简体   繁体   中英

How to train and test a simple binary classifier from CSV file?

I made a below CSV file with tweets bigram and I want to train a model to predict the labels. Most of the examples I found on the web include numerical features with additional parameters, which makes it hard to understand. Here I asked a very simple example to understand what exactly should be done with python (using libraries like scikit-learn) to train and test the classification model (any model) with this simplest CSV dataset.

bigram, label
I love, 0
love you, 0
I hate, 1
hate you, 1
...

I hope this post helps other machine learning beginners as well.

You are trying to solve an NLP problem. The typical machine learning algorithm will not work on texts. You need to convert this text into numbers. Python Spacy or NLTK library is designed to solve this problem. Normally it would create a vocabulary of words and each would be assigned to a number. That means input will be connected to a list of numbers and algorithms can be applied.

Here is the sample code again, however, there is much more to it.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

text_clf_nb.fit(X_train, y_train)

predictions = text_clf_nb.predict(X_test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM