简体   繁体   English

使用基础 R 根据另一个数据集中的值索引替换数据集中的值

[英]Replace values in a dataset based off an index of values in another using base R

structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
"B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
"G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
    diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
    NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
    NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
-6L), class = "data.frame")

I would like to replace the values in the diagnosis columns with the values from this index:我想用该索引中的值替换诊断列中的值:

B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4

In reality the table has thousands of rows and I deal with other tables with a variable number of diagnosis columns so a solution which is agnostic to the number of columns would be ideal.实际上,该表有数千行,并且我处理具有可变数量的诊断列的其他表,因此与列数无关的解决方案将是理想的。 The index is also up to a few hundred entries long..该索引也长达数百个条目。

If the index table was divided like this:如果索引表是这样划分的:

1 B657, B662
2 B658
3 B659, F8827, G432 
4 B660 H8940    
5 B661

Would that make a difference to the way it is coded?这会对它的编码方式产生影响吗?

The desired output would look like this:所需的 output 如下所示:

在此处输入图像描述

Many thanks非常感谢

you can try this你可以试试这个

df_replace <- read.table(text="B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4", stringsAsFactors = F)

vec_repl <-  as.character(df_replace$V2)
names(vec_repl) <- df_replace$V1

library(tidyverse)
df %>% 
  mutate_at(vars(starts_with("diag")), ~str_replace_all(., vec_repl))
    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13
1  123           1           3        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
2 5345           2           3           1        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
3  234           3        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
4  453           4           1        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
5 3656           5        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
6  345           1           4           3           1        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
  age
1  54
2  65
3  23
4  22
5  33
6  77

In base R you can try with the additional packe stingr this在基础R中,您可以尝试使用额外的 packe stingr

df2 <- df
# use -c(1,ncol(df)) to select only columns where to replace values. 
df2[,-c(1,ncol(df))] <- lapply(df[,-c(1,ncol(df))], function(x) str_replace_all(x, vec_repl))
head(df2)
    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13
1  123           1           3        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
2 5345           2           3           1        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
3  234           3        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
4  453           4           1        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
5 3656           5        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
6  345           1           4           3           1        <NA>        <NA>        <NA>        <NA>        <NA>         <NA>         <NA>         <NA>         <NA>
  age
1  54
2  65
3  23
4  22
5  33
6  77

One possible solution is to first construct a vector tab_vec with the old values as names and the new values and actual values.一种可能的解决方案是首先构造一个向量tab_vec ,其中旧值作为名称,新值和实际值。 Afterwards, we can use the recode function from package dplyr (version >= 1.0.0 ) and use it across the variables whose name starts with the string "diagnosis" .之后,我们可以使用 package dplyr (version >= 1.0.0 ) 中的recode function 并在名称以"diagnosis"开头across变量中使用它。

tab <- read.table(text="B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4", header=F)

# create vector of replacements
tab_vec <- as.numeric(tab$V2)
names(tab_vec) <- tab$V1
tab_vec 

# substitute the replacement values in the dataframe df
dplyr::mutate(df, across(starts_with("diagnosis"), ~recode(as.character(.), !!!tab_vec)))

Output Output

    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77

You can use match to change content using a lookup table.您可以使用match使用查找表更改内容。

i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- y[match(unlist(x[,i]), y[,1]),2]
x
#    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
#1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
#2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
#3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
#4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
#5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
#6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77

And in case the lookup has a the given different structure:如果查找具有给定的不同结构:

zz <- strsplit(z, "[, ]+")
zz <- setNames(rep(seq_along(zz), lengths(zz)), unlist(zz))
i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- zz[unlist(x[,i])]

In case codes are not found and you don't want to set them to NA.如果找不到代码并且您不想将它们设置为 NA。

i <- startsWith(colnames(x), "diagnosis_")
j <- match(unlist(x[,i]), y[,1])
k <- !is.na(j)
tt <- unlist(x[,i])
tt[k] <- y[j[k],2]
x[,i] <- tt
rm(i, j, k, tt)

Data:数据:

x <- structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
"B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
"G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
    diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
    NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
    NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
                                                         -6L), class = "data.frame")
y <- read.table(text="B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4")
z <- readLines(con=textConnection("B657, B662
B658
B659, F8827, G432
B660 H8940
B661"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM