I have a data frame where I am trying to remove rows where the year is not sequential.
Here is a sample of my data frame:
Name Year Position Year_diff FBv ind1 velo_diff
1 Aaron Heilman 2005 RP 2 90.1 TRUE 0.0
2 Aaron Heilman 2003 SP NA 89.4 NA 0.0
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
5 Alexi Ogando 2015 RP 2 94.5 TRUE 0.0
6 Alexi Ogando 2013 SP NA 93.4 FALSE 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The expected output should be:
Name Year Position Year_diff FBv ind1 velo_diff
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The reason Alexi Ogando 2011-2012 is still there is because his sequence of SP
to RP
is met in line with consecutive years. Ogando's 2013-2015 SP
to RP
sequence is not met with consecutive years.
An element which might help is that each sequence where the years aren't sequential, the velo_diff will be 0.0
Would anybody know how to do this? All help is appreciated.
You can do a grouped filter
, checking if the subsequent or previous year exists and if the Position
matches accordingly:
library(dplyr)
df <- read.table(text = 'Name Year Position Year_diff FBv ind1 velo_diff
1 "Aaron Heilman" 2005 RP 2 90.1 TRUE 0.0
2 "Aaron Heilman" 2003 SP NA 89.4 NA 0.0
3 "Aaron Laffey" 2010 RP 1 86.8 TRUE -0.6
4 "Aaron Laffey" 2009 SP NA 87.4 NA 0.0
5 "Alexi Ogando" 2015 RP 2 94.5 TRUE 0.0
6 "Alexi Ogando" 2013 SP NA 93.4 FALSE 0.0
7 "Alexi Ogando" 2012 RP 1 97.0 TRUE 1.9
8 "Alexi Ogando" 2011 SP NA 95.1 NA 0.0', header = TRUE)
df %>% group_by(Name) %>%
filter(((Year - 1) %in% Year & Position == 'RP') |
((Year + 1) %in% Year & Position == 'SP'))
#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#>
#> Name Year Position Year_diff FBv ind1 velo_diff
#> <fctr> <int> <fctr> <int> <dbl> <lgl> <dbl>
#> 1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#> 2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#> 3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#> 4 Alexi Ogando 2011 SP NA 95.1 NA 0.0
We can use data.table
library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp = cumsum(Position == "RP"))]$V1]
# Name Year Position Year_diff FBv ind1 velo_diff
#1: Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2: Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3: Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4: Alexi Ogando 2011 SP NA 95.1 NA 0.0
Or using the same methodology with dplyr
library(dplyr)
df1 %>%
group_by(Name, grp = cumsum(Position == "RP")) %>%
filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
ungroup() %>%
select(-grp)
# A tibble: 4 × 7
# Name Year Position Year_diff FBv ind1 velo_diff
# <chr> <int> <chr> <int> <dbl> <lgl> <dbl>
#1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.