Categorical Data evaluation in Python with get_dummies

Question

I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to? The problem is that get_dummies returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data. Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.

I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter.

Doing the binning:

est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))

Extact the cateorical data (do I have to?):

category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")

Prepare the split:

one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)

The Error is:

ValueError: Found input variables with inconsistent number of samples: [1458, 1460]

The correct size of samples is 1460. The one_hot encoded is two samples short. Why is it so?

Answer 1

When you are encoding your data you use category[1:-1] . This will encode all the elements from the second to the second to last element.

Explanation:

1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.

Solution: Change your line to one_hot_features = pd.get_dummies(category[:])

Categorical Data evaluation in Python with get_dummies

Question

1 answers

solution1
1 2019-04-12 19:42:17

Categorical Data evaluation in Python with get_dummies

Question

1 answers

solution1 1 2019-04-12 19:42:17

solution1
1 2019-04-12 19:42:17