[英]How to replace certain data frame value with it's unknown column name?
I have a large data frame with unknown column names and numeric values 1, 2, 3, or 4. Now I want to replace all 4 values with it's column name and all 1, 2 and 3's with an empty value.我有一个包含未知列名和数值 1、2、3 或 4 的大型数据框。现在我想用它的列名替换所有 4 个值,用空值替换所有 1、2 和 3。
Ofcourse I can make a loop of some kind, like this:当然,我可以制作某种循环,如下所示:
df <- data.frame(id=1:8,unknownvarname1=c(1:4,1:4),unknownvarname2=c(4:1,4:1))
for (i in 2:length(df)){
df[,i] <- as.character(df[,i])
df[,i] <- mgsub::mgsub(df[,i],c(1,2,3,4),c("","","",names(df)[i]))
}
This would be the result:这将是结果:
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1 unknownvarname2
For a data frame this size that's no problem at all.对于这样大小的数据框,这完全没有问题。 But when I try this loop on large data frames with up to 30k and up to 40 uknown variables, the loop takes ages to complete.
但是,当我在具有多达 30k 和多达 40 个未知变量的大型数据帧上尝试此循环时,循环需要很长时间才能完成。
Does anyone know of a faster way to do this?有谁知道更快的方法来做到这一点? I tried functions like
mutate()
of dplyr package
but I could not manage to make it work.我尝试了诸如
dplyr package
的mutate()
之类的功能,但我无法使其工作。
Many thanks in advance!提前谢谢了!
One way using base R使用基础 R 的一种方法
#Replace all the values with 1:3 with blank
df[-1][sapply(df[-1], `%in%`, 1:3)] <- ""
#Get the row/column indices where value is 4
mat <- which(df == 4, arr.ind = TRUE)
#Exclude values from first column
mat <- mat[mat[, 2] != 1, ]
#Replace remaining entries with it's corresponding column names
df[mat] <- names(df)[mat[, 2]]
df
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
Just to give another option with switch
(though, as this function is not vectorized, it needs a nested sapply
within a lapply
which doesn't make it that "pretty" and efficient...):只是为了给
switch
提供另一个选项(虽然,由于这个 function 没有矢量化,它需要一个嵌套在sapply
中的lapply
,这不会使它变得“漂亮”和高效......):
Basically, switch
works with numeric
as switch(myNumberToTest, caseIfOne, caseIfTwo, ...)
.基本上,
switch
使用numeric
作为switch(myNumberToTest, caseIfOne, caseIfTwo, ...)
。
So what you need is:所以你需要的是:
df[, 2:3] <- lapply(2:3, function(x) sapply(df[, x], switch, "", "", "", names(df)[x]))
df
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
Yet another base R option, using ifelse within lapply (still looping on the columns, but vectorized approach by column):另一个基本 R 选项,在 lapply 中使用 ifelse (仍在列上循环,但按列矢量化方法):
df <- data.frame(id=1:8,unknownvarname1=c(1:4,1:4),unknownvarname2=c(4:1,4:1))
df[,2:3] <- lapply(2:3, function(x) { ifelse(df[,x] < 4, "", colnames(df)[x]) })
gives给
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Another base R possibility using sweep
:使用
sweep
的另一个基础 R 可能性:
idx <- df[, -1] == 4
sw <- sweep(idx, 2, 1:2, FUN = '*') + 1
df[, -1] <- c("", colnames(df[, -1]))[sw]
which gives:这使:
> df id unknownvarname1 unknownvarname2 1 1 unknownvarname2 2 2 3 3 4 4 unknownvarname1 5 5 unknownvarname2 6 6 7 7 8 8 unknownvarname1
This could be shortened to:这可以缩短为:
sw <- sweep(df[, -1] == 4, 2, 1:2, FUN = '*') + 1
df[, -1] <- c("", colnames(df[, -1]))[sw]
A somewhat inefficient tidyverse
option.一个有点低效的
tidyverse
选项。 This is inefficient because we need to manually select the columns later:这是低效的,因为我们需要稍后手动 select 列:
to_use <- names(df)[-1]
df %>%
mutate_at(vars(contains("unknown")),list(~ifelse(.==4,
NA,
""))) -> new_df
new_df[-1] <-map2(new_df[-1], to_use,function(x,y) replace(x,is.na(x),y))
A less manual approach that also has the disadvantage of being non specific:一种较少手动的方法,也具有不具体的缺点:
df %>%
map2(.,names(.), function(x, y) ifelse( x==4, y,"")) %>%
as.data.frame() %>%
mutate(id=row.names(.)) # might be a way around with `.id`
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Result for approach 1:方法 1 的结果:
new_df
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Yet another option using col
to line up the names and values:另一个使用
col
排列名称和值的选项:
sel <- df[-1] == 4
df[-1] <- ""
df[-1][sel] <- names(df[-1])[col(df[-1])[sel]]
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.