简体   繁体   中英

Handling missing (nan) values on sklearn.preprocessing

I'm trying to normalize data with missing (ie nan) values before processing it, using scikit-learn preprocessing.

Apparently, some scalers (eg StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (eg Normalizer) just raise an error.

I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?

from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np

data = np.array([0,1,2,np.nan, 3,4])

scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))

normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))    

The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:

Normalize samples individually to unit norm.

Each sample (ie each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one ( source here )

That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.

You need to circumvent such a situation. The most reasonable way to do it is to:

  1. first create a mask in order to record which elements were missing in your array
  2. create a response array filled with missing values
  3. apply the Normalizer to your array after selecting only the valid entries
  4. record on your response array the normalized values based on their original position

here some code detailing the process, based on your example:

from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np

data = np.array([0,1,2,np.nan, 3,4])

# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask

normalizer = Normalizer(norm='l2')

# create a result array
result = np.full(data.shape, np.nan)

# assign only valid cases to 
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM