I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to? The problem is that get_dummies
returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data. Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.
I checked the reference for the get_dummies
function and searched for the problem but could not find why the length is shorter.
Doing the binning:
est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))
Extact the cateorical data (do I have to?):
category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")
Prepare the split:
one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)
The Error is:
ValueError: Found input variables with inconsistent number of samples: [1458, 1460]
The correct size of samples is 1460. The one_hot
encoded is two samples short. Why is it so?
When you are encoding your data you use category[1:-1]
. This will encode all the elements from the second to the second to last element.
Explanation:
1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.
Solution: Change your line to one_hot_features = pd.get_dummies(category[:])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.