简体   繁体   English

如何在不具有多重共线性的熊猫中创建虚拟变量?

[英]How can one create dummy variables in pandas that do not have multicollinearity?

Using Anaconda, Python 2.7.11, pandas 0.17.1, Mac OS X 10.11 (El Capitan), how do you drop a dummy variable from each column that you are making dummy variables out of to avoid multicollinearity (or the dummy variable trap) when fitting to a statistical model? 使用Anaconda,Python 2.7.11,pandas 0.17.1,Mac OS X 10.11(El Capitan),如何从每个列中删除虚拟变量,以避免多重共线性(或虚拟变量陷阱)何时适合统计模型?

If one enters: 如果进入:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

Returned is: 返回的是:

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

I want to drop a, b, or c columns to avoid multicollinearity. 我想删除a,b或c列以避免多重共线性。

This functionality will be added in pandas version 0.18.0 (currently 0.17.1). 此功能将添加到pandas版本0.18.0(目前为0.17.1)中。 But if you would like this functionality sooner, then you will have to build the pandas library from source. 但是如果你想更快地使用这个功能,那么你将不得不从源代码构建pandas库。 The following instructions will show you how to do this. 以下说明将向您展示如何执行此操作。 First, in a terminal, uninstall pandas by typing: 首先,在终端中,键入以下命令卸载pandas:

conda uninstall pandas

Then, navigate to site-packages, where Python stores its libraries: 然后,导航到site-packages,Python存储其库:

cd /Users/[username]/anaconda/lib/python2.7/site-packages

where [username] is your username. 其中[username]是您的用户名。 The root of this path may be wherever your currently activated Python environment is located, to generalize beyond Anaconda. 此路径的根目录可能是您当前激活的Python环境所在的位置,以便在Anaconda之外进行概括。 To reveal where your activated version of python is located, type: 要显示激活的python版本所在的位置,请键入:

which python

Enter these commands into terminal to clone into a repo where someone has added extra functionality to the pandas.get_dummies command: 将这些命令输入终端以克隆到repo,其中有人为pandas.get_dummies命令添加了额外的功能:

git clone https://github.com/BranYang/pandas
cd pandas
python setup.py build_ext --inplace --force
python setup.py install

Then, open Python (or IPython): 然后,打开Python(或IPython):

ipython

and enter: 并输入:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first = True)

And this will be displayed: 这将显示:

   b  c
0  0  0
1  1  0
2  0  1
3  0  0

Thus, pd.get_dummies has dropped your first column and you have avoided the dummy variable trap! 因此,pd.get_dummies已经删除了你的第一列,你已经避免了虚拟变量陷阱!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM