简体   繁体   English

使用get_dummies在Python中进行分类数据评估

[英]Categorical Data evaluation in Python with get_dummies

I want to evaluate categorical data in Python with a decision tree. 我想用决策树评估Python中的分类数据。 I want to use the categorical data and use binning to create categorical labels. 我想使用分类数据并使用装箱来创建分类标签。 Do I have to? 我一定要吗? The problem is that get_dummies returns a dataframe with a different length then the values that were given. 问题是get_dummies返回的数据帧的长度与给定的值不同。 It is two rows shorter than the original data. 它比原始数据短两行。 Previously I tried to use the labelencode, but didn't get it done. 以前,我尝试使用labelencode,但没有完成。 I tried the get_dummies form pandas which seamed more easily to me. 我尝试了get_dummies形式的熊猫,它对我来说更容易缝。

I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter. 我检查了get_dummies函数的引用,并搜索了问题,但找不到长度较短的原因。

Doing the binning: 进行装箱:

est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))

Extact the cateorical data (do I have to?): 提取类别数据(我必须这样做):

category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")

Prepare the split: 准备拆分:

one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)

The Error is: 错误是:

ValueError: Found input variables with inconsistent number of samples: [1458, 1460]

The correct size of samples is 1460. The one_hot encoded is two samples short. 样本的正确大小为one_hot编码的样本数为两个。 Why is it so? 为什么会这样呢?

When you are encoding your data you use category[1:-1] . 在对数据进行编码时,请使用category[1:-1] This will encode all the elements from the second to the second to last element. 这将对从第二个元素到第二个元素到最后一个元素的所有元素进行编码。

Explanation: 说明:

1) Indexes are zero based so 1 is the index of the second item. 1)索引是从零开始的,因此1是第二项的索引。
2) Index of -1 means the second to last element. 2)索引-1表示倒数第二个元素。

Solution: Change your line to one_hot_features = pd.get_dummies(category[:]) 解决方案:将您的行更改为one_hot_features = pd.get_dummies(category[:])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM