简体   繁体   中英

Scikit-Learn issues error for RandomForestClassifier for multilabel classification - Jagged arrays

Scikit-Learn RandomForestClassifier throws an error for a multilabel classification problem.

  1. This code creates a RandomForestClassifier multilabel object, given predictors C and multi-labels out with no error.
C = np.array([[2,4,6],[4,2,1],[8,3,1]])
out = np.array([[0,1],[0,1],[1,0]])
rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(C,out) 
  1. If I modify the multilabels, so that all the elements at a certain index are the same, say (where all the first components of the multilabels equals zero)
out = np.array([[0,1],[0,1],[0,0]])

I get an error and traceback:

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a 
list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. 
If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  y_pred = np.array(y_pred, copy=False)

raise ValueError(
    507             "The type of target cannot be used to compute OOB "
    508             f"estimates. Got {y_type} while only the following are "
    509             "supported: continuous, continuous-multioutput, binary, "
    510             "multiclass, multilabel-indicator."
    511         )
ValueError: could not broadcast input array from shape (2,1) into shape (2,)
  1. Not requesting OOB predictions does not result in an error:
rf_err = RandomForestClassifier(n_estimators=100, oob_score=False)

I cannot figure out why keeping the OOB predictions would trigger such an error, when all the n-component of a multilabel are equal.

In your setup out_err = np.array([[0,1],[0,1],[0,0]]) you do not have any examples of the second class, so you only have elements of 1 class.

That means that there is no 'class label' dimension and it can be omitted. That's why you see (2,) shape.

Please, describe your initial intent: why would you need to set a particular position in labels to 0. If you try to go with N-1 classes instead of N classes I suggest removing the position itself and the elements of the class from the dataset, not putting all zeros:

out=[[1,0,0],[0,1,0],[0,1,0],[0,0,1],[1,0,0]]  # 3 classes
# remove the second class:
out=[[1,0],[0,1],[1,0]]  # 2 classes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM