[英]How to create dummy variables using pandas with reference to one value?
test = {'ngrp' : ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']}
test = pd.DataFrame(test)
dummy = pd.get_dummies(test['ngrp'], drop_first = True)
This gives me:这给了我:
Brooklyn Manhattan Queens Staten Island
0 0 1 0 0
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 0 0
I will get Bronx as my reference level (because that is what gets dropped), how do I change it to specify that Manhattan should be my reference level?我将 Bronx 作为我的参考水平(因为这是被丢弃的),我如何更改它以指定曼哈顿应该是我的参考水平? My expected output is
我预期的 output 是
Brooklyn Queens Staten Island Bronx
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
get_dummies
sorts your values (lexicographically) and then creates dummies. get_dummies
对您的值进行排序(按字典顺序),然后创建虚拟对象。 That's why you don't see "Bronx" in your initial result;这就是为什么您在初始结果中看不到“Bronx”的原因; its because it was the first sorted value in your column, so it was dropped first.
它是因为它是您列中的第一个排序值,所以它首先被删除。
To avoid the behavior you see, enforce the ordering to be on a "first-seen" basis (ie, convert it to an ordered categorical).为避免您看到的行为,请强制以“先见”为基础进行排序(即,将其转换为有序的分类)。
pd.get_dummies(
pd.Categorical(test['ngrp'], categories=test['ngrp'].unique(), ordered=True),
drop_first=True)
Brooklyn Queens Staten Island Bronx
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Of course, this has the side effect of returning dummies with categorical column names as the result, but that's almost never an issue.当然,这具有返回具有分类列名称的假人作为结果的副作用,但这几乎不是问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.