I have a dataset which contains multiple rows and multiple columns and i want to extract the unique rows by ignoring NA's from one column in few cases and including NA's in few cases. Please look below in detail
dataset_A
e_id age fn ln custom_id
e1234 23 sur bab 1344789
e1234 23 sur bab 1344789
e1234 23 sur bab 1617
e1234 23 sur bab NA
e2345 22 nav kum NA
e2345 22 nav kum 52109
e2345 22 nav kum NA
e3456 21 ash kuma NA
e3456 21 ash kuma NA
e4567 23 anu kot NA
Expected_output
e_id age fn ln custom_id
e1234 23 sur bab 1344789
e1234 23 sur bab 1617
e2345 22 nav kum 52109
e3456 21 ash kuma NA
e4567 23 anu kot NA
Basically, I want to ignore rows with NA from custom_id if custom_id's are present for that e_id, whereas if the user has only NA values in a custom_id column, I want to keep 1 row and ignore other rows.
Tried:
final_output = dataset_A[order(dataset_A$custom_id),]
final_output = final_output[!duplicated(final_output[,c(1:4)]),]
With my above piece of code, I am not able to extract a few rows from my dataset like 1617 custom_id for e_1234 e_id. It would be really helpful if we are able to find the solution for the same.
We could use slice
from dplyr
grouping by e_id
and return only 1st row if all
values for custom_id
are NA
else return all the non-NA rows and then apply distinct
to get unique rows.
library(dplyr)
df %>%
group_by(e_id) %>%
slice(if(all(is.na(custom_id))) 1 else which(!is.na(custom_id))) %>%
distinct()
# e_id age fn ln custom_id
# <fct> <int> <fct> <fct> <int>
#1 e1234 23 sur bab 1344789
#2 e1234 23 sur bab 1617
#3 e2345 22 nav kum 52109
#4 e3456 21 ash kuma NA
#5 e4567 23 anu kot NA
And maybe I have over-complicated the base R approach but one using ave
would be
unique(df[with(df, ave(is.na(custom_id), e_id, FUN = function(x)
if (all(x)) c(TRUE, rep(FALSE, length(x) - 1)) else
replace(rep(TRUE, length(x)), x, FALSE))), ])
# e_id age fn ln custom_id
#1 e1234 23 sur bab 1344789
#3 e1234 23 sur bab 1617
#6 e2345 22 nav kum 52109
#8 e3456 21 ash kuma NA
#10 e4567 23 anu kot NA
If understood you correctly you can use dplyr as follows:
library(dplyr)
data %>% filter (., is.na(custom_id)==FALSE) %>% distinct(.)
If you want to keep the NANs you can add if else to the slice command
Book2 %>% group_by(., e_id) %>%
slice(., ifelse(all(is.na(custom_id)), 1 , which(!is.na(custom_id))))
Edit: Someone was faster than me so please go to the previous answer
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.