I'm a newbie in python and pandas. I'm trying to preprocess a big dataframe which consists of both numerical and categorical features and in some columns there are NaN values. first I try to get the feature matrix and then use Imputer to put the mean or median value for Nan values.
this is the dataframe
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour \
0 60 RL 65.0 8450 Pave Reg Lvl
1 20 RL 80.0 9600 Pave Reg Lvl
2 60 RL 68.0 11250 Pave IR1 Lvl
3 70 RL 60.0 9550 Pave IR1 Lvl
4 60 RL 84.0 14260 Pave IR1 Lvl
5 50 RL 85.0 14115 Pave IR1 Lvl
6 20 RL 75.0 10084 Pave Reg Lvl
7 60 RL NaN 10382 Pave IR1 Lvl
8 50 RM 51.0 6120 Pave Reg Lvl
9 190 RL 50.0 7420 Pave Reg Lvl
10 20 RL 70.0 11200 Pave Reg Lvl
11 60 RL 85.0 11924 Pave IR1 Lvl
code: just to change the Nan values in LotFrontage (index number = 2) to mean value of the column
imputer = Imputer(missing_values='Nan',strategy="mean",axis=0)
features = reduced_data.iloc[:,:-1].values
imputer.fit(features[:,2])
when I run this, an error occurs which says:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
first: Is my approach correct? second: How to handle the Error?
thanks
Note the difference between Nan and NaN (note the capital N at the end) you have used Nan
imputer = Imputer(missing_values='NaN',strategy="mean",axis=0)
Replace 'Nan' with 'NaN' and you won't get this error
Try this it is an example of working code
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = np.nan, strategy = 'mean', axis =0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
I guess that due to string 'Nan',your LotFrontage columns data is stored as object data type.Find out using this.It will give object/string most probably.
print(reduced_data.LotFrontage.values.dtype)
Imputer only works on Floats.
1st Approach:
You can do below: 1) Convert column type to Float 2) findout mean of column LotFrontage 3) Use pandas dataframe function fillna to fill NANs in Dataframe.
reduced_data.LotFrontage = pd.to_numeric(reduced_data.LotFrontage, errors='coerce')
m = reduced_data.LotFrontage.mean(skipna=True)
reduced_data.fillna(m)
Above code will fillna in Dataframe wherever NANs are present.
2nd Approach:
reduced_data.LotFrontage = pd.to_numeric(reduced_data.LotFrontage, errors='coerce')
imputer = Imputer()
features = reduced_data.iloc[:,:-1].values
imputer.fit(features[:,2])
In missing_value parameter use 'NaN' instead of 'Nan': imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
This should work
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(df.iloc[:, 2:3])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.