一种热编码R中的数据帧

Question

Consider a data-frame df1 similar to the one shown 考虑与所示数据帧类似的数据帧df1

ID EDUCATION   OCCUPATION      BINARY_VAR
1  Undergrad   Student              1
2  Grad        Business Owner       1
3  Undergrad   Unemployed           0
4  PhD         Other                1

You may create your own random df1 using R-code below 您可以使用以下R代码创建自己的随机df1

ID <- c(1:4)
EDUCATION <- sample (c('Undergrad', 'Grad', 'PhD'), 4, rep = TRUE)
OCCUPATION <- sample (c('Student', 'Business Owner', 'Unemployed', 'Other'), 4, rep = FALSE)
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

From this, I need to derive another data-frame df2 that would look like this 由此，我需要导出另一个数据帧df2 ，看起来像这样

ID Undergrad Grad PhD Student Business Owner Unemployed Other BINARY_VAR
1      1      0    0     1           0           0        0       1
2      1      1    0     0           1           0        0       1
3      1      0    0     0           0           1        0       0
4      1      1    1     0           0           0        1       1

You must have noticed how for level PhD , the other factor levels under EDUCATION also hold true since EDUCATION is the highest education level for that ID . 您一定已经注意到，对于PhD级别来说，“ EDUCATION下的其他因素级别也是如此，因为“ EDUCATION是该ID的最高教育级别。 That, however, is the secondary objective. 但是，这是次要目标。

I can't seem to figure out a way to obtain a data-frame with each column giving the truth value corresponding to individual factor levels in its parent data-frame . 我似乎无法找到一种方法来获取数据框架，每列都给出与其父数据框架中各个因子水平相对应的真值 。 Is there a package in R that could help? R中是否有可以提供帮助的软件包？ Or maybe a way to code this? 还是一种编码方式？

Can I do this using melt ? 我可以使用melt吗？

I read through previously asked question (s) that looked similar, but they deal with frequencies of occurrence. 我通读了以前 看起来很相似的问题，但是它们处理出现的频率。

Edit: 编辑：

As recommended by Sumedh , one way to do this is using dummyVars from caret . 根据Sumedh的建议，一种方法是使用caret dummyVars 。

dummies <- dummyVars(ID ~ ., data = df1)
df2 <- data.frame(predict(dummies, newdata = df1))
df2 <- df2 [1:7]

Answer 1

tidyr and dplyr combined with that base table() function should work: tidyr和dplyr与该base table()函数结合使用应该起作用：

ID <- c(1:4)
EDUCATION <- c('Undergrad', 'Grad', 'PhD', 'Undergrad')
OCCUPATION <- c('Student', 'Business Owner', 'Unemployed', 'Other')
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

library(dplyr)
library(tidyr)

edu<-as.data.frame(table(df1[,1:2])) %>% spread(EDUCATION, Freq)

for(i in 1:nrow(edu))
  if(edu[i,]$PhD == 1) 
    edu[i,]$Undergrad <-edu[i,]$Grad <-1

truth_table<-merge(edu,
      as.data.frame(table(df1[,c(1,3)])) %>% spread(OCCUPATION, Freq),
      by = "ID")

truth_table$BINARY_VAR<-df1$BINARY_VAR
truth_table

ID Grad PhD Undergrad Business Owner Other Student Unemployed BINARY_VAR
1    0   0         1              0     0       1          0          1
2    1   0         0              1     0       0          0          1
3    1   1         1              0     0       0          1          0
4    0   0         1              0     1       0          0          1

Edit: added a for loop to update the education levels beneath PhD inspired by @ Sumedh's earlier suggestion. 编辑：在@ Sumedh的早期建议的启发下，添加了一个for循环来更新PhD的教育水平。

一种热编码R中的数据帧

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-06-21 13:55:29

一种热编码R中的数据帧

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-06-21 13:55:29

解决方案1
0 已采纳 2016-06-21 13:55:29