Multiple Output Machine Learning Model - Python

Question

Hello everyone I've tried searching this topic and haven't been able to find a good answer so I was hoping someone could help me out. Let's say I am trying to create a ML model using scikit-learn and python. I have a data set as such:

| Features | Topic   | Sub-Topic        |
|----------|---------|------------------|
| ...      | Science | Space            |
| ...      | Science | Engineering      |
| ...      | History | American History |
| ...      | History | European History |

My features list is composed of just text such as a small paragraph from some essay. Now I want to be able to use ML to predict what the topic and sub-topic of that text will be.

I know I would need to use some sort of NLP to analyze the text such as spaCy. The part where I am confused is on having two output variables: topic and sub-topic. I've read that scikit-learn has something called MultiOutputClassifier, but then there is also something called MultiClass Classification so I'm just a little confused as to what route to take.

Could someone please point me in the right direction as to what regressor to use or how to achieve this?

Answer 1

So MultiClass is just saying there are multiple classes in one target variable. MultiOutput means we have more than one target variable. Here we have a MultiClass-MultiOutput problem.

scikit-learn supports MultiClass-MultiOutput for the below classifier natively.

sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier

I'd suggest picking up RandomForest as most of the times it gives great results out of the box.

So to take a dummy example to demonstrate the api of RandomForestClassifier for multiple targets.

### Dummy Example only to test functionality
np.random.seed(0)
X = np.random.randn(10,2)
y1 = (X[:,[0]]>.5).astype(int) # make dummy y1
y2 = (X[:,[1]]<.5).astype(int) # make dummy y2
y = np.hstack([y1,y2]) # y has 2 columns
print("X = ",X,sep="\n",end="\n\n")
print("y = ",y,sep="\n",end="\n\n")
rfc = RandomForestClassifier().fit(X, y) # use the same api for multi column y!
out = rfc.predict(X)
print("Output = ",out,sep="\n")

Output

X = 
[[ 1.76405235  0.40015721]
 [ 0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788]
 [ 0.95008842 -0.15135721]
 [-0.10321885  0.4105985 ]
 [ 0.14404357  1.45427351]
 [ 0.76103773  0.12167502]
 [ 0.44386323  0.33367433]
 [ 1.49407907 -0.20515826]
 [ 0.3130677  -0.85409574]]

y = 
[[1 1]
 [1 0]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 1]]

Output = 
[[1 1]
 [1 0]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 1]]

On a side note, as you are doing an NLP related model, I'd suggest using Keras's multi-output NN api to train a neural network for better outputs!

Multiple Output Machine Learning Model - Python

Question

1 answers

solution1
1 ACCPTED 2019-11-05 01:34:50

Multiple Output Machine Learning Model - Python

Question

1 answers

solution1 1 ACCPTED 2019-11-05 01:34:50

solution1
1 ACCPTED 2019-11-05 01:34:50