[英]Pandas: get_dummies vs categorical
I have a dataset which has a few columns with categorical data. 我有一个数据集,其中包含一些带有分类数据的列。
I've been using the Categorical function to replace categorical values with numerical ones. 我一直在使用Categorical函数将数字值替换为分类值。
data[column] = pd.Categorical.from_array(data[column]).codes
I've recently ran across the pandas.get_dummies function. 我最近碰到了pandas.get_dummies函数。 Are these interchangeable?
这些可以互换吗? Is there an advantage of using one over the other?
使用一个优于另一个有优势吗?
Why are you converting the categorical datas to integers? 为什么要将分类数据转换为整数? I don't believe you save memory if that is your goal.
如果这是你的目标,我不相信你会节省记忆。
df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat 6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat 6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes
The categorical codes are just integer values for the unique items in the given category. 分类代码只是给定类别中唯一项的整数值。 By contrast, get_dummies returns a new column for each unique item.
相比之下, get_dummies为每个唯一项返回一个新列。 The value in the column indicates whether or not the record has that attribute.
列中的值指示记录是否具有该属性。
>>> pd.core.reshape.get_dummies(df)
Out[30]:
cat_a cat_b cat_c
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
To get the codes directly, you can use: 要直接获取代码,您可以使用:
df['codes'] = [df.cat.codes.to_list()]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.