[英]One hot coding a data-frame in R
Consider a data-frame df1
similar to the one shown 考虑与所示数据帧类似的数据帧
df1
ID EDUCATION OCCUPATION BINARY_VAR
1 Undergrad Student 1
2 Grad Business Owner 1
3 Undergrad Unemployed 0
4 PhD Other 1
You may create your own random df1
using R-code below 您可以使用以下R代码创建自己的随机
df1
ID <- c(1:4)
EDUCATION <- sample (c('Undergrad', 'Grad', 'PhD'), 4, rep = TRUE)
OCCUPATION <- sample (c('Student', 'Business Owner', 'Unemployed', 'Other'), 4, rep = FALSE)
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)
# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)
From this, I need to derive another data-frame df2
that would look like this 由此,我需要导出另一个数据帧
df2
,看起来像这样
ID Undergrad Grad PhD Student Business Owner Unemployed Other BINARY_VAR
1 1 0 0 1 0 0 0 1
2 1 1 0 0 1 0 0 1
3 1 0 0 0 0 1 0 0
4 1 1 1 0 0 0 1 1
You must have noticed how for level PhD
, the other factor levels under EDUCATION
also hold true since EDUCATION
is the highest education level for that ID
. 您一定已经注意到,对于
PhD
级别来说,“ EDUCATION
下的其他因素级别也是如此,因为“ EDUCATION
是该ID
的最高教育级别。 That, however, is the secondary objective. 但是,这是次要目标。
I can't seem to figure out a way to obtain a data-frame with each column giving the truth value corresponding to individual factor levels in its parent data-frame . 我似乎无法找到一种方法来获取数据框架,每列都给出与其父数据框架中各个因子水平相对应的真值 。 Is there a package in R that could help?
R中是否有可以提供帮助的软件包? Or maybe a way to code this?
还是一种编码方式?
Can I do this using melt
? 我可以使用
melt
吗?
I read through previously asked question (s) that looked similar, but they deal with frequencies of occurrence. 我通读了以前 看起来很相似的问题,但是它们处理出现的频率。
Edit: 编辑:
As recommended by Sumedh , one way to do this is using dummyVars
from caret
. 根据Sumedh的建议,一种方法是使用
caret
dummyVars
。
dummies <- dummyVars(ID ~ ., data = df1)
df2 <- data.frame(predict(dummies, newdata = df1))
df2 <- df2 [1:7]
tidyr
and dplyr
combined with that base table()
function should work: tidyr
和dplyr
与该base table()
函数结合使用应该起作用:
ID <- c(1:4)
EDUCATION <- c('Undergrad', 'Grad', 'PhD', 'Undergrad')
OCCUPATION <- c('Student', 'Business Owner', 'Unemployed', 'Other')
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)
# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)
library(dplyr)
library(tidyr)
edu<-as.data.frame(table(df1[,1:2])) %>% spread(EDUCATION, Freq)
for(i in 1:nrow(edu))
if(edu[i,]$PhD == 1)
edu[i,]$Undergrad <-edu[i,]$Grad <-1
truth_table<-merge(edu,
as.data.frame(table(df1[,c(1,3)])) %>% spread(OCCUPATION, Freq),
by = "ID")
truth_table$BINARY_VAR<-df1$BINARY_VAR
truth_table
ID Grad PhD Undergrad Business Owner Other Student Unemployed BINARY_VAR
1 0 0 1 0 0 1 0 1
2 1 0 0 1 0 0 0 1
3 1 1 1 0 0 0 1 0
4 0 0 1 0 1 0 0 1
Edit: added a for
loop to update the education levels beneath PhD
inspired by @ Sumedh's earlier suggestion. 编辑:在@ Sumedh的早期建议的启发下,添加了一个
for
循环来更新PhD
的教育水平。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.