简体   繁体   中英

SkLearn model for text classification

I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library. This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.

But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.

I expected a different behavior of my classifier.

I mean, for example, suppose that I have a short piece of text like this:

"There are a lot of types of virus and bacterias that cause disease."

But my classifier was trained with three types of texts, let say "maths", "history" and "technology".

So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.

Is there a more appropriate method or model to obtain the results that I just described?

Do I use the wrong way the decision_function ?

Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.

I think I need to find some light on these issues (text classification, none binary classification, etc.)

Many thanks in advance for any help!

There are multiple parts to your question I will try to answer as much as I can.

  1. I don't understand why this method always tries to distribute the total of probabilities?

That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.

To address your problem ie the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes eg maths , history and technology as per your example it will be binned to the other class.

  1. Addressing the problem that your data point could possibly belong to multiple classes.

This is what typically multi-label classification is used for.

Hope this helps!

What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.

Simple example to demonstrate multi-label classification:

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline

X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
    .fit_transform([[newsgroups_train.target_names[i]]
                      for i in newsgroups_train.target])

model = make_pipeline(TfidfVectorizer(stop_words='english'),
                      MultiOutputClassifier(LinearSVC()))

model.fit(X, y)

print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']


print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]


print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]


print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]

A common way of dealing with this is to try and cast your text sample into some kind of vector space and measure the "distance" between that and some archetypical positions within that same vector space that represent classifications.

This model of a classifier is convenient because if you collapse your text sample into a vector of vocabulary frequencies, it almost trivially can be expressed as a vector - where the dimensions are defined by the number of vocabulary features you choose to track.

By cluster-analysis of a wider text corpus, you can try and determine central points that commonly occur within clusters, and you can describe these in terms of the vector-positions at which they are located.

And finally, with a handful of cluster-centers defined, you can simply pythagoras your way into finding which of these topic-clusters your chosen sample lies the closest to - but you also have at your fingertips the relative distances between your sample and all the other cluster centres as well - so it's less probabilistic, more a spatial measure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM