简体   繁体   中英

Filtering duplicated rows conditionally

What would be a good tidyverse approach to this type of problem? I want to filter out the duplicated rows of group that have an NA in them (keeping the row that has values for both var1 and var2 ) but keep the rows when there is no duplicated value in group . dat illustrates the raw example with expected_output showing what I'd hope to have.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)

dat <- tibble::tribble(
  ~group, ~var1, ~var2,
  "A", "foo", NA,
  "A", "foo", "bar",
  "B", "foo", NA,
  "C", NA, "bar",
  "C", "foo", "bar",
  "D", NA, "bar",
  "E", "foo", "bar",
  "E", NA, "bar"
)



expected_output <- tibble::tribble(
  ~group, ~var1, ~var2,
  "A", "foo", "bar",
  "B", "foo", NA,
  "C", "foo", "bar",
  "D", NA, "bar",
  "E", "foo", "bar"
)
expected_output
#> # A tibble: 5 x 3
#>   group var1  var2 
#>   <chr> <chr> <chr>
#> 1 A     foo   bar  
#> 2 B     foo   <NA> 
#> 3 C     foo   bar  
#> 4 D     <NA>  bar  
#> 5 E     foo   bar

Any suggestions or ideas?

Solution 1 - if the duplicate rows are located in different positions for each group (eg first, last or somewhere in between)

dat %>%
  arrange(group,var1,var2) %>% 
  group_by(group) %>% 
  slice_head() %>% 
  ungroup()

Output:

# A tibble: 5 x 3
  group var1  var2 
  <chr> <chr> <chr>
1 A     foo   bar  
2 B     foo   NA   
3 C     foo   bar  
4 D     NA    bar  
5 E     foo   bar  

Solution 2 - if the duplicate row is always the last row of that group

You can use duplicated with the fromLast option set to keep the last matched line, find the index of matches, negate it, and use that to remove duplicates as follows:

dat[!duplicated(dat$group, fromLast = TRUE), ]

which gives your requested output:

# A tibble: 4 x 3
  group var1  var2 
  <chr> <chr> <chr>
1 A     foo   bar  
2 B     foo   NA   
3 C     foo   bar  
4 D     NA    bar

One option could be:

dat %>% 
 group_by(group) %>%
 slice_max(rowSums(!is.na(across(c(var1, var2)))), 1)

  group var1  var2 
  <chr> <chr> <chr>
1 A     foo   bar  
2 B     foo   <NA> 
3 C     foo   bar  
4 D     <NA>  bar  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM