dealing with dimensions in scikit-learn tree.decisiontreeclassifier

Question

I am trying to do a decision tree using scikit-learn with three dimensional training data and two dimensional target data. As a simple example, imagine an rgb image. lets say my target data is 1's and 0's, where 1's represent the presence of a human face, and 0's represent the absence. Take for example:

red         green        blue        face presence  

1000        0001         0011        0000    
0110        0110         0001        0110    
0110        0110         0000        0110

An array of the rgb data would represent the training data, and the 2d array would represent my target classes (face, no-face).

In Python these arrays may look like:

rgb = np.array([[[1,0,0,0],[0,1,1,0],[0,1,1,0]],
               [[0,0,0,1],[0,1,1,0],[0,1,1,0]],
               [[0,0,1,1],[0,0,0,1],[0,0,0,0]]])

face = np.array([[0,0,0,0],[0,1,1,0],[0,1,1,0]])

Unfortunately, this doesn't work

import numpy as np
from sklearn import tree
dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(rgb, face)

This throws this error:

Found array with dim 3. Expected <= 2

I have tried reshaping and flattening the data several ways and get another error:

Number of labels=xxx does not match number of samples

Does anyone know how I can use tree.DecisionTreeClassifier to accomplish this? Thanks.

Answer 1

I think I have figured it out. It's not very pretty. Maybe someone can offer some help cleaning up the code. Basically, I needed to organize the rgb data to be an array of 12 3-element arrays, or shape=(12,3). For example...

np.hsplit(np.dstack(rgb).flatten(), len(face.flatten()))

I also flatten the face data, so my final fit call becomes...

dt_clf = dt_clf.fit(np.hsplit(np.dstack(rgb).flatten(), len(face.flatten())), 
                    face.flatten())

Now I can test a new dataset and see if it works. The target image indicated face presence when both red and green pixels were shown, so a good test might be...

red         green        blue 

1100        1100         0011  
1100        1100         0001  
0000        0000         0000

or...

predict = np.array([[[1,1,0,0],[1,1,0,0],[0,0,0,0]],
                    [[1,1,0,0],[1,1,0,0],[0,0,0,0]],
                    [[0,0,1,1],[0,0,0,1],[0,0,0,0]]])

so...

predicted = dt_clf.predict(np.hsplit(np.dstack(predict).flatten(),
                           len(face.flatten())))

and to get it back in the proper dimensions...

predicted = np.array(np.hsplit(predicted, face.shape[0]))

which yields us

array([[1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 0, 0]])

Wonderful! Now to see if this works on something bigger. Please feel free to offer suggestions to make this cleaner.

dealing with dimensions in scikit-learn tree.decisiontreeclassifier

Question

1 answers

solution1
0 ACCPTED 2015-06-13 03:57:01

dealing with dimensions in scikit-learn tree.decisiontreeclassifier

Question

1 answers

solution1 0 ACCPTED 2015-06-13 03:57:01

solution1
0 ACCPTED 2015-06-13 03:57:01