[英]Find values from one column in another column according to ID in r
I have a data frame with multiple entries for each ID.我有一个数据框,每个 ID 都有多个条目。 An ID has a reference number (NEW_REF) and an old reference number (OLD_REF).
ID 有一个参考号 (NEW_REF) 和一个旧参考号 (OLD_REF)。 I need to find the most recent reference number for each ID, meaning the reference number that is not in the Old reference number column.
我需要为每个 ID 找到最新的参考号,这意味着旧参考号列中没有的参考号。
ID <- c(1,2,3,4,1,3,5,2,4,1,3,4)
NEW_REF <- c("TS101","TS253","TS565","TS789","TD123","TS101","TD367","TH152","TD123","TF908","TD256","TS898")
OLD_REF <- c("TD123","TH152","TS101","TD123","TF908","TD256","TG232","TR142","TS898","TR268","TB496","TD969")
DF <- data.frame(ID,NEW_REF ,OLD_REF )
DF$Active_ind <- NA
DF$Active_ind[which(DF$NEW_REF %in% DF$OLD_REF )] <-"N" #if a reference number is in the old reference number column it is not active or not the most recent
DF$Active_ind[which(!(DF$NEW_REF %in% DF$OLD_REF ))] <-"Y" #if a reference number is not in the old reference number column it is active or the most recent
ID NEW_REF OLD_REF Active_ind
1 1 TS101 TD123 N
2 2 TS253 TH152 Y
3 3 TS565 TS101 Y
4 4 TS789 TD123 Y
5 1 TD123 TF908 N
6 3 TS101 TD256 N
7 5 TD367 TG232 Y
8 2 TH152 TR142 N
9 4 TD123 TS898 N
10 1 TF908 TR268 N
11 3 TD256 TB496 N
12 4 TS898 TD969 N
My problem is that ID 1 has a new reference TS101 (row 1) and ID 3 has an old reference TS101 (row 3).我的问题是 ID 1 有一个新的引用 TS101(第 1 行),而 ID 3 有一个旧的引用 TS101(第 3 行)。 How do I check which reference number is most recent per ID if the reference numbers are not unique.
如果参考编号不唯一,我如何检查每个 ID哪个参考编号是最新的。
I would like Row 1 to have a Y in the Active_ind column:我希望第 1 行的 Active_ind 列中有一个 Y:
ID NEW_REF OLD_REF Active_ind
1 1 TS101 TD123 Y
2 2 TS253 TH152 Y
3 3 TS565 TS101 Y
4 4 TS789 TD123 Y
5 1 TD123 TF908 N
6 3 TS101 TD256 N
7 5 TD367 TG232 Y
8 2 TH152 TR142 N
9 4 TD123 TS898 N
10 1 TF908 TR268 N
11 3 TD256 TB496 N
12 4 TS898 TD969 N
I know it is possible with a for loop, but I would like to avoid it as my data set has over 40 000 different IDs and becomes very time intensive when a loop is used.我知道 for 循环是可能的,但我想避免它,因为我的数据集有超过 40 000 个不同的 ID,并且在使用循环时变得非常耗时。
We can use dplyr
to group them by ID
and then check if the values in NEW_REF
is present in OLD_REF
and give them the values accordingly.我们可以使用
dplyr
按ID
对它们进行分组,然后检查OLD_REF
NEW_REF
,并相应地为它们提供值。
library(dplyr)
DF %>%
group_by(ID) %>%
mutate(Active_Ind = ifelse(NEW_REF %in% OLD_REF, "N", "Y"))
# ID NEW_REF OLD_REF Active_Ind
# <dbl> <fctr> <fctr> <chr>
# 1 TS101 TD123 Y
# 2 TS253 TH152 Y
# 3 TS565 TS101 Y
# 4 TS789 TD123 Y
# 1 TD123 TF908 N
# 3 TS101 TD256 N
# 5 TD367 TG232 Y
# 2 TH152 TR142 N
# 4 TD123 TS898 N
# 1 TF908 TR268 N
# 3 TD256 TB496 N
# 4 TS898 TD969 N
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.