In my dataset I have two columns. POINT: contains only two categorical values 'random' and 'current' repeated all over the dataset. ID: contains a set of 5-digit continuous values associated to the values in POINT. Some of the values in ID are repeated.
I cannot figure out a code in R to eliminate ONLY the raws that have duplicated values in the ID column when the POINT value is 'random' when compared to 'current'. So I would like the below dataset :
POINT | ID |
---|---|
Current | 45905 |
Current | 40817 |
Current | 55936 |
Current | 66608 |
Current | 66608 |
Random | 45905 |
Random | 40817 |
Random | 55936 |
Random | 66608 |
Random | 44456 |
to look like this:
POINT | ID |
---|---|
Current | 45905 |
Current | 40817 |
Current | 55936 |
Current | 66608 |
Current | 66608 |
Random | 44456 |
Making use of dpylr
this could be achieved like so:
POINT
anti_join
d <- data.frame(
stringsAsFactors = FALSE,
POINT = c("Current","Current","Current",
"Current","Current","Random","Random","Random",
"Random","Random"),
ID = c(45905L,40817L,55936L,66608L,
66608L,45905L,40817L,55936L,66608L,44456L)
)
d_split <- split(d, d$POINT)
library(dplyr)
random_keep <- dplyr::anti_join(d_split$Random, d_split$Current, by = "ID")
d_final <- dplyr::bind_rows(d_split$Current, random_keep)
head(d_final)
#> POINT ID
#> 1 Current 45905
#> 2 Current 40817
#> 3 Current 55936
#> 4 Current 66608
#> 5 Current 66608
#> 6 Random 44456
If I understood you right, you could use dplyr
to do this:
library(dplyr)
split_data <- split(your_data, ~ POINT)
full_join(split_data$Current, split_data$Random, by = "ID") %>%
transmute(POINT = coalesce(POINT.x, "Random"), ID)
Returns:
# A tibble: 6 x 2
POINT ID
<chr> <int>
1 Current 45905
2 Current 40817
3 Current 55936
4 Current 66608
5 Current 66608
6 Random 44456
(Data used:)
your_data <- structure(list(POINT = c("Current", "Current", "Current", "Current", "Current", "Random", "Random", "Random", "Random", "Random"), ID = c(45905L, 40817L, 55936L, 66608L, 66608L, 45905L, 40817L, 55936L, 66608L, 44456L)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
You can use the negated %in%
to exclude duplicated ID for POINT==random
.
i <- D$POINT=="Current"
D[i | !D$ID %in% D$ID[i],]
# POINT ID
#1 Current 45905
#2 Current 40817
#3 Current 55936
#4 Current 66608
#5 Current 66608
#10 Random 44456
Data:
D <- data.frame(POINT = c("Current","Current","Current","Current","Current"
,"Random","Random","Random","Random","Random")
, ID = c(45905L,40817L,55936L,66608L,66608L,45905L,40817L,55936L,66608L,44456L))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.