简体   繁体   中英

Removing rows based where data isn't sequential in R, dplyr

I have a data frame where I am trying to remove rows where the year is not sequential.

Here is a sample of my data frame:

         Name       Year Position Year_diff  FBv     ind1  velo_diff
1     Aaron Heilman 2005       RP         2  90.1    TRUE      0.0
2     Aaron Heilman 2003       SP         NA 89.4      NA      0.0 
3     Aaron Laffey  2010       RP         1  86.8    TRUE     -0.6 
4     Aaron Laffey  2009       SP         NA 87.4      NA      0.0
5     Alexi Ogando  2015       RP         2  94.5    TRUE      0.0
6     Alexi Ogando  2013       SP         NA 93.4   FALSE      0.0
7     Alexi Ogando  2012       RP         1  97.0    TRUE      1.9
8     Alexi Ogando  2011       SP         NA 95.1      NA      0.0

The expected output should be:

          Name      Year  Position Year_diff  FBv    ind1   velo_diff
3     Aaron Laffey  2010       RP         1   86.8    TRUE    -0.6
4     Aaron Laffey  2009       SP         NA  87.4      NA     0.0
7     Alexi Ogando  2012       RP         1   97.0    TRUE     1.9
8     Alexi Ogando  2011       SP         NA  95.1      NA     0.0

The reason Alexi Ogando 2011-2012 is still there is because his sequence of SP to RP is met in line with consecutive years. Ogando's 2013-2015 SP to RP sequence is not met with consecutive years.

An element which might help is that each sequence where the years aren't sequential, the velo_diff will be 0.0

Would anybody know how to do this? All help is appreciated.

You can do a grouped filter , checking if the subsequent or previous year exists and if the Position matches accordingly:

library(dplyr)

df <- read.table(text = 'Name       Year Position Year_diff  FBv     ind1  velo_diff
1     "Aaron Heilman" 2005       RP         2  90.1    TRUE      0.0
2     "Aaron Heilman" 2003       SP         NA 89.4      NA      0.0 
3     "Aaron Laffey"  2010       RP         1  86.8    TRUE     -0.6 
4     "Aaron Laffey"  2009       SP         NA 87.4      NA      0.0
5     "Alexi Ogando"  2015       RP         2  94.5    TRUE      0.0
6     "Alexi Ogando"  2013       SP         NA 93.4   FALSE      0.0
7     "Alexi Ogando"  2012       RP         1  97.0    TRUE      1.9
8     "Alexi Ogando"  2011       SP         NA 95.1      NA      0.0', header = TRUE)

df %>% group_by(Name) %>% 
    filter(((Year - 1) %in% Year & Position == 'RP') | 
           ((Year + 1) %in% Year & Position == 'SP'))

#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#> 
#>           Name  Year Position Year_diff   FBv  ind1 velo_diff
#>         <fctr> <int>   <fctr>     <int> <dbl> <lgl>     <dbl>
#> 1 Aaron Laffey  2010       RP         1  86.8  TRUE      -0.6
#> 2 Aaron Laffey  2009       SP        NA  87.4    NA       0.0
#> 3 Alexi Ogando  2012       RP         1  97.0  TRUE       1.9
#> 4 Alexi Ogando  2011       SP        NA  95.1    NA       0.0

We can use data.table

library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp  = cumsum(Position == "RP"))]$V1]
#           Name Year Position Year_diff  FBv ind1 velo_diff
#1: Aaron Laffey 2010       RP         1 86.8 TRUE      -0.6
#2: Aaron Laffey 2009       SP        NA 87.4   NA       0.0
#3: Alexi Ogando 2012       RP         1 97.0 TRUE       1.9
#4: Alexi Ogando 2011       SP        NA 95.1   NA       0.0

Or using the same methodology with dplyr

library(dplyr)
df1 %>%
   group_by(Name, grp = cumsum(Position == "RP")) %>%  
   filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
   ungroup() %>%
   select(-grp)
# A tibble: 4 × 7
#           Name  Year Position Year_diff   FBv  ind1 velo_diff
#          <chr> <int>    <chr>     <int> <dbl> <lgl>     <dbl>
#1 Aaron Laffey  2010       RP         1  86.8  TRUE      -0.6
#2 Aaron Laffey  2009       SP        NA  87.4    NA       0.0
#3 Alexi Ogando  2012       RP         1  97.0  TRUE       1.9
#4 Alexi Ogando  2011       SP        NA  95.1    NA       0.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM