简体   繁体   中英

How to extract unique rows by ignoring NA's in R

I have a dataset which contains multiple rows and multiple columns and i want to extract the unique rows by ignoring NA's from one column in few cases and including NA's in few cases. Please look below in detail

dataset_A

e_id      age    fn    ln     custom_id
e1234     23     sur   bab    1344789
e1234     23     sur   bab    1344789
e1234     23     sur   bab    1617
e1234     23     sur   bab    NA
e2345     22     nav   kum    NA
e2345     22     nav   kum    52109
e2345     22     nav   kum    NA
e3456     21     ash   kuma   NA
e3456     21     ash   kuma   NA
e4567     23     anu   kot    NA

Expected_output

e_id      age    fn    ln     custom_id
e1234     23     sur   bab    1344789
e1234     23     sur   bab    1617
e2345     22     nav   kum    52109
e3456     21     ash   kuma   NA
e4567     23     anu   kot    NA

Basically, I want to ignore rows with NA from custom_id if custom_id's are present for that e_id, whereas if the user has only NA values in a custom_id column, I want to keep 1 row and ignore other rows.

Tried:

final_output = dataset_A[order(dataset_A$custom_id),]
final_output = final_output[!duplicated(final_output[,c(1:4)]),]

With my above piece of code, I am not able to extract a few rows from my dataset like 1617 custom_id for e_1234 e_id. It would be really helpful if we are able to find the solution for the same.

We could use slice from dplyr grouping by e_id and return only 1st row if all values for custom_id are NA else return all the non-NA rows and then apply distinct to get unique rows.

library(dplyr)
df %>%
  group_by(e_id) %>%
  slice(if(all(is.na(custom_id))) 1 else which(!is.na(custom_id))) %>%
  distinct()

#   e_id    age fn    ln    custom_id
#  <fct> <int> <fct> <fct>     <int>
#1 e1234    23 sur   bab     1344789
#2 e1234    23 sur   bab        1617
#3 e2345    22 nav   kum       52109
#4 e3456    21 ash   kuma         NA
#5 e4567    23 anu   kot          NA

And maybe I have over-complicated the base R approach but one using ave would be

unique(df[with(df, ave(is.na(custom_id), e_id, FUN = function(x) 
   if (all(x)) c(TRUE, rep(FALSE, length(x) - 1)) else 
               replace(rep(TRUE, length(x)), x, FALSE))), ])


#    e_id age  fn   ln custom_id
#1  e1234  23 sur  bab   1344789
#3  e1234  23 sur  bab      1617
#6  e2345  22 nav  kum     52109
#8  e3456  21 ash kuma        NA
#10 e4567  23 anu  kot        NA

If understood you correctly you can use dplyr as follows:

library(dplyr)
data %>% filter (., is.na(custom_id)==FALSE) %>% distinct(.)

If you want to keep the NANs you can add if else to the slice command

Book2 %>%  group_by(., e_id) %>%
  slice(., ifelse(all(is.na(custom_id)), 1 , which(!is.na(custom_id))))

Edit: Someone was faster than me so please go to the previous answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM