繁体   English   中英

将具有多个唯一单词的列中的每一行拆分为多个列

[英]Split each Row in a column with Multiple Unique words into Multiple Columns

我想创建一个 function 来创建检查单行中沿列向下的唯一单词并基于此创建虚拟列。 例如:

ID Letters
1  A, B, C
2  C, D
3  A
4  B, D
5  Z
6  A

预期结果将是:

ID Letters  Letter_A Letter_B Letter_C Letter_D Letter_Z
1  A, B, C   1          1        1        0       0
2  C, D      0          0        1        1       0
3  A         1          0        0        0       0
4  B, D      0          1        0        1       0
5  Z         0          0        0        0       1
6  A         1          0        0        0       0

我找到了这段代码

uniq <- unique(unlist(strsplit(as.character(df$values),', ')))
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("Letter_", uniq)))

for (i in seq_along(df$values)) {
  k <- match(df$values[i], uniq, 0)
  m[i,k] <- 1
}

其中 uniq 将创建一个新的保存由逗号分隔的每个唯一单词并创建一个新列 Letter_A 等。但是,forloop 只会检查该列中的第一个字母。 所以当前的结果看起来像这样,其他字母没有被更改为 1

ID Letters  Letter_A Letter_B Letter_C Letter_D Letter_Z
1  A, B, C   1          0        0        0       0
2  C, D      0          0        1        0       0
3  A         1          0        0        0       0
4  B, D      0          1        0        0       0
5  Z         0          0        0        0       1
6  A         1          0        0        0       0

这是一种方法:

DF = data.frame(ID = seq_len(6L),
                Letters = c("A, B, C", "C, D", "A", "B, D", "Z", "A"))

spl_letters = strsplit(as.character(DF[["Letters"]]), ", ", fixed = TRUE)
uniq = unique(unlist(spl_letters), use.names = FALSE)

data.frame(DF,
           setNames(data.frame(t(vapply(spl_letters, function(x) +(uniq %in% x), seq_along(uniq)))), paste0("Letter_", uniq))
)

  ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1  1 A, B, C        1        1        1        0        0
2  2    C, D        0        0        1        1        0
3  3       A        1        0        0        0        0
4  4    B, D        0        1        0        1        0
5  5       Z        0        0        0        0        1
6  6       A        1        0        0        0        0

基本上,将for循环更改为vapply ,而不是unlist ,保留原始strsplit结果以匹配uniq

代码:

library(data.table)
setDT(df)
dcast(data = df[, strsplit(Letters, split = ","), by = .(ID, Letters)][, V1 := trimws(V1)][],
      formula = ID + Letters ~ V1, 
      fun.aggregate = length, 
      value.var = "V1")
#    ID Letters A B C D Z
# 1:  1 A, B, C 1 1 1 0 0
# 2:  2    C, D 0 0 1 1 0
# 3:  3       A 1 0 0 0 0
# 4:  4    B, D 0 1 0 1 0
# 5:  5       Z 0 0 0 0 1
# 6:  6       A 1 0 0 0 0

数据:

df <- read.table(text='ID Letters
1  "A, B, C"
                 2  "C, D"
                 3  "A"
                 4  "B, D"
                 5  "Z"
                 6  "A"', header = TRUE, stringsAsFactors = FALSE)

您可以使用mtabulate库中的qdapTools

library(qdapTools)
library(dplyr)

x <- "
ID Letters
1  'A, B, C'
2  'C, D'
3  A
4  'B, D'
5  Z
6  A
"

df <- read.table(text = x, header = TRUE, stringsAsFactors = FALSE)

encoded_df <- cbind(df, mtabulate(strsplit(df$Letters, ", "))) %>% 
              rename_at(vars(!colnames(df)), ~paste0("Letter_", .))

这将对字母应用一种热编码,然后将Letter_前缀添加到所有创建的新列中。

使用stats::xtabsDF的选项来自 Cole 的解决方案:

l <- strsplit(DF$Letters, ", ") 
tab <- data.frame(ID=rep(seq_along(l), lengths(l)), Letters=unlist(l), V=1L)
cbind(DF, as.data.frame.matrix(xtabs(V ~ ID + Letters, tab)))

output:

  ID Letters A B C D Z
1  1 A, B, C 1 1 1 0 0
2  2    C, D 0 0 1 1 0
3  3       A 1 0 0 0 0
4  4    B, D 0 1 0 1 0
5  5       Z 0 0 0 0 1
6  6       A 1 0 0 0 0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM