Multi-target having dependent variables as both classification and regression?

Question

I have two inputs as my independent variables and I want to predict 3 dependent variables based on it.

My 3 dependent variables are of 2 multi-categorical classes and 1 is of continuous values. Below is my target variables.

typeid_encoded , reporttype_encoded , log_count

typeid_encoded and reporttype_encoded are of categorical type where each variable has min 5 different categories.

log_count is continuous variable.

I have googled a lot, all I found is to use two different models. But I couldn't find any example to do so. please post some example so that it helps me?

or is there any other approach to using neural networks is it possible to do in one model?

I need an example using sci-kit learn. Thanks in advance!

Answer 1

There's nothing in sklearn that's designed for this, but there are few little tricks you could use to make classifiers like this.

Word of caution , these are not necessarily ideal for your problem, it's very hard to guess at what would work for your data.

The two that first came to my mind were Knn and Random Forests, but you can essentially adapt any multi-output regression algorithm to do these things.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import NearestNeighbors

# Create some data to look like yours
n_samples = 100
n_features = 5

X = np.random.random((n_samples, n_features))
y_classifcation = np.random.random((n_samples, 2)).round()
y_regression = np.random.random((n_samples))

y = np.hstack((y_classifcation, y_regression[:, np.newaxis]))

Now I have a data set with two binary variables and one continuous

Start with Knn, you could do this with KNeighborsRegressor as well but I felt this illustrated the solution better

# use an odd number to prevent tie-breaks
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X, y)

idxs = nn.kneighbors(X, return_distance=False)
# take the average of the nearest neighbours to get the predictions
y_pred = y[idxs].mean(axis=1)
# all predictions will be continous so just round the continous ones
y_pred[:, 2] = y_pred[:, 2].round()

Now our y_pred is the vector of predictions for both the classification and regression. So now let's look a Random Forest.

# use an odd number of trees to prevent predictions of 0.5
rf = RandomForestRegressor(n_estimators=11)
rf.fit(X, y)
y_pred = rf.predict(X)

# all predictions will be continous so just round the continous ones
y_pred[:, 2] = y_pred[:, 2].round()

I'd say these 'hacks' are pretty reasonable because they aren't too far away from how the classification settings of these algorithms work.

If you have a multiclass problem which you have one hot encoded, then instead of rounding the probability to the binary class, as I have done above, you will need to chose the class with the highest probability. You can do this pretty simply using something like this

n_classes_class1 = 3
n_classes_class2 = 4
y_pred_class1 = np.argmax(y_pred[:, :n_classes_class1], axis=1)
y_pred_class2 = np.argmax(y_pred[:, n_classes_class1:-1], axis=1)

Multi-target having dependent variables as both classification and regression?

Question

1 answers

solution1
1 2018-03-27 12:49:00

Multi-target having dependent variables as both classification and regression?

Question

1 answers

solution1 1 2018-03-27 12:49:00

solution1
1 2018-03-27 12:49:00