简体   繁体   中英

Categorical Data evaluation in Python with get_dummies

I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to? The problem is that get_dummies returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data. Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.

I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter.

Doing the binning:

est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))

Extact the cateorical data (do I have to?):

category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")

Prepare the split:

one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)

The Error is:

ValueError: Found input variables with inconsistent number of samples: [1458, 1460]

The correct size of samples is 1460. The one_hot encoded is two samples short. Why is it so?

When you are encoding your data you use category[1:-1] . This will encode all the elements from the second to the second to last element.

Explanation:

1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.

Solution: Change your line to one_hot_features = pd.get_dummies(category[:])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM