简体   繁体   English

如何参考一个值使用 pandas 创建虚拟变量?

[英]How to create dummy variables using pandas with reference to one value?

test = {'ngrp' : ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']}
test = pd.DataFrame(test)
dummy = pd.get_dummies(test['ngrp'], drop_first = True)

This gives me:这给了我:

   Brooklyn  Manhattan  Queens  Staten Island
0         0          1       0              0
1         1          0       0              0
2         0          0       1              0
3         0          0       0              1
4         0          0       0              0

I will get Bronx as my reference level (because that is what gets dropped), how do I change it to specify that Manhattan should be my reference level?我将 Bronx 作为我的参考水平(因为这是被丢弃的),我如何更改它以指定曼哈顿应该是我的参考水平? My expected output is我预期的 output 是

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1

get_dummies sorts your values (lexicographically) and then creates dummies. get_dummies对您的值进行排序(按字典顺序),然后创建虚拟对象。 That's why you don't see "Bronx" in your initial result;这就是为什么您在初始结果中看不到“Bronx”的原因; its because it was the first sorted value in your column, so it was dropped first.它是因为它是您列中的第一个排序值,所以它首先被删除。

To avoid the behavior you see, enforce the ordering to be on a "first-seen" basis (ie, convert it to an ordered categorical).为避免您看到的行为,请强制以“先见”为基础进行排序(即,将其转换为有序的分类)。

pd.get_dummies(
    pd.Categorical(test['ngrp'], categories=test['ngrp'].unique(), ordered=True), 
    drop_first=True)                                       

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1

Of course, this has the side effect of returning dummies with categorical column names as the result, but that's almost never an issue.当然,这具有返回具有分类列名称的假人作为结果的副作用,但这几乎不是问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM