简体   繁体   中英

sklearn: Classifier to predict on a MaskedArray

I am trying to figure out how to deal with a classifier prediction on a numpy Masked array (instead of a regular numpy array). Here is my code:

# My masked array on which to perform the prediction
>>> type(patch)
    numpy.ma.core.MaskedArray
>>> patch.shape
    (3,3,14)
# This is how it looks like the first layer in the 3rd dimension. 
>>> patch[:,:,0]
    masked_array(
  data=[[90, 28, 16],
        [79, 32, --],
        [41, --, --]],
  mask=[[False, False, False],
    [False, False,  True],
    [False,  True,  True]],
 fill_value=999999,
 dtype=uint16)

In the above code you can see the first layer in the third dimension. There are 14 layers as you can see from patch.shape . Each of them has positions: (1,2), (2,1) and (2,2) masked!

Now, I use a pre-trained RandomForest classifier cl to classify the values of the patch with ids 1,4,6. I would like the classifier to ignore the masked values for the classification process, but after doing:

>>> class_pred = cl.predict(patch.reshape(-1, patch.shape[2]))
>>> class_pred = class_pred.reshape(patch[:,:,0].shape)

I get:

>>> class_pred 
    array([[4, 4, 4],
           [4, 4, 1],
           [4, 1, 1]])

So the positions at (1,2), (2,1) and (2,2) are not masked anymore but they were also classified.

Is there a way to force the classifier to ignore the masked values during the classification process? in order to obtain something like this:

masked_array(
  data=[[4, 4, 4],
        [4, 4, --],
        [4, --, --]],
  mask=[[False, False, False],
    [False, False,  True],
    [False,  True,  True]],
 fill_value=999999,
 dtype=uint16)

The answer right now is I think: Scikit Learn ignores masks on data passed. Whatever the underlying value of that masked array is in the masked data, will be used by the classifier to fit and predict , therefore you get a class value.

For your specific case: how important is that the input has a matrix structure? If these inputs are always masked (eg because they are triangular arrays) you might want to unravel them into vectors. Even for full square matrices like images, people do that (think a ConvNet for example).

On a broader sense, if what you are doing is representing missing values, then I must say that this kind of ML is still in an embrionary stage (but advancing at a pace). I can recommend you the book "Statistical Analysis with Missing Data", which has quite a few algorithms.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM