[英]Split each Row in a column with Multiple Unique words into Multiple Columns
我想创建一个 function 来创建检查单行中沿列向下的唯一单词并基于此创建虚拟列。 例如:
ID Letters
1 A, B, C
2 C, D
3 A
4 B, D
5 Z
6 A
预期结果将是:
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 A, B, C 1 1 1 0 0
2 C, D 0 0 1 1 0
3 A 1 0 0 0 0
4 B, D 0 1 0 1 0
5 Z 0 0 0 0 1
6 A 1 0 0 0 0
我找到了这段代码
uniq <- unique(unlist(strsplit(as.character(df$values),', ')))
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("Letter_", uniq)))
for (i in seq_along(df$values)) {
k <- match(df$values[i], uniq, 0)
m[i,k] <- 1
}
其中 uniq 将创建一个新的保存由逗号分隔的每个唯一单词并创建一个新列 Letter_A 等。但是,forloop 只会检查该列中的第一个字母。 所以当前的结果看起来像这样,其他字母没有被更改为 1
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 A, B, C 1 0 0 0 0
2 C, D 0 0 1 0 0
3 A 1 0 0 0 0
4 B, D 0 1 0 0 0
5 Z 0 0 0 0 1
6 A 1 0 0 0 0
这是一种方法:
DF = data.frame(ID = seq_len(6L),
Letters = c("A, B, C", "C, D", "A", "B, D", "Z", "A"))
spl_letters = strsplit(as.character(DF[["Letters"]]), ", ", fixed = TRUE)
uniq = unique(unlist(spl_letters), use.names = FALSE)
data.frame(DF,
setNames(data.frame(t(vapply(spl_letters, function(x) +(uniq %in% x), seq_along(uniq)))), paste0("Letter_", uniq))
)
ID Letters Letter_A Letter_B Letter_C Letter_D Letter_Z
1 1 A, B, C 1 1 1 0 0
2 2 C, D 0 0 1 1 0
3 3 A 1 0 0 0 0
4 4 B, D 0 1 0 1 0
5 5 Z 0 0 0 0 1
6 6 A 1 0 0 0 0
基本上,将for
循环更改为vapply
,而不是unlist
,保留原始strsplit
结果以匹配uniq
。
代码:
library(data.table)
setDT(df)
dcast(data = df[, strsplit(Letters, split = ","), by = .(ID, Letters)][, V1 := trimws(V1)][],
formula = ID + Letters ~ V1,
fun.aggregate = length,
value.var = "V1")
# ID Letters A B C D Z
# 1: 1 A, B, C 1 1 1 0 0
# 2: 2 C, D 0 0 1 1 0
# 3: 3 A 1 0 0 0 0
# 4: 4 B, D 0 1 0 1 0
# 5: 5 Z 0 0 0 0 1
# 6: 6 A 1 0 0 0 0
数据:
df <- read.table(text='ID Letters
1 "A, B, C"
2 "C, D"
3 "A"
4 "B, D"
5 "Z"
6 "A"', header = TRUE, stringsAsFactors = FALSE)
您可以使用mtabulate
库中的qdapTools
。
library(qdapTools)
library(dplyr)
x <- "
ID Letters
1 'A, B, C'
2 'C, D'
3 A
4 'B, D'
5 Z
6 A
"
df <- read.table(text = x, header = TRUE, stringsAsFactors = FALSE)
encoded_df <- cbind(df, mtabulate(strsplit(df$Letters, ", "))) %>%
rename_at(vars(!colnames(df)), ~paste0("Letter_", .))
这将对字母应用一种热编码,然后将Letter_
前缀添加到所有创建的新列中。
使用stats::xtabs
和DF
的选项来自 Cole 的解决方案:
l <- strsplit(DF$Letters, ", ")
tab <- data.frame(ID=rep(seq_along(l), lengths(l)), Letters=unlist(l), V=1L)
cbind(DF, as.data.frame.matrix(xtabs(V ~ ID + Letters, tab)))
output:
ID Letters A B C D Z
1 1 A, B, C 1 1 1 0 0
2 2 C, D 0 0 1 1 0
3 3 A 1 0 0 0 0
4 4 B, D 0 1 0 1 0
5 5 Z 0 0 0 0 1
6 6 A 1 0 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.