I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...
My data set can be represented as follows:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Which yields a dataframe that looks like:
date id
1/02/2015 1008
1/02/2015 1009
1/02/2015 1011
1/02/2015 1015
1/02/2015 1008
1/02/2015 1014
1/02/2015 1015
1/02/2015 1012
1/02/2015 1012
1/02/2015 1006
1/02/2015 1008
1/02/2015 1007
1/02/2015 1012
1/02/2015 1009
1/02/2015 1013
2/02/2015 1010
2/02/2015 1013
2/02/2015 1015
2/02/2015 1009
2/02/2015 1013
2/02/2015 1015
2/02/2015 1008
2/02/2015 1012
2/02/2015 1007
2/02/2015 1008
2/02/2015 1009
2/02/2015 1006
2/02/2015 1009
2/02/2015 1014
2/02/2015 1009
2/02/2015 1010
3/02/2015 1011
3/02/2015 1010
3/02/2015 1007
3/02/2015 1014
3/02/2015 1012
3/02/2015 1013
3/02/2015 1007
3/02/2015 1013
3/02/2015 1010
Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.
My attempt (or one of my attempts) looks like:
data %>%
group_by(date) %>%
distinct(id) %>%
do(lost = anti_join(., lag(.), by="id"))
But of course this does not work, I just get:
Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)
Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?
I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).
Generating the sample data set the same way but with more data to make it more interesting:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:
dataNested <- data %>%
group_by(date) %>%
distinct(id) %>%
nest()
Which yields a fairly strange dataframe that looks like:
date data
1 2015-02-01 list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2 2015-02-02 list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3 2015-02-03 list(id = c(6, 5, 2, 9, 7, 8))
4 2015-02-04 list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5 2015-02-05 list(id = c(3, 5, 4, 7, 8, 1, 9))
Whereby the indexes in the lists reference a list of the id's (strange but true).
This now allows us to reference the groups by index number viz:
dataNested$data[[2]]
returns:
# A tibble: 9 × 1
id
<fctr>
1 1010
2 1013
3 1015
4 1009
5 1008
6 1012
7 1007
8 1006
From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):
## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
tempList <- vector("list", nrow(groups))
# Loop through the groups and do an anti_join between each
for (i in seq(1, nrow(groups) - 1)) {
tempList[[i + 1]] <-
anti_join(data.frame(groups$data[[i]]), data.frame(groups$data[[i + 1]]), by = "id")
}
return(tempList)
}
Applying this function to our nested data yields the list of lists of departed id's:
> departedIDs <- dataNested %>% departed()
> departedIDs
[[1]]
NULL
[[2]]
id
1 1011
[[3]]
id
1 1006
2 1008
3 1009
4 1015
[[4]]
id
1 1007
[[5]]
id
1 1011
2 1015
I hope this answer will help others who's brain works the same way as mine.
Just add the input stringsAsFactors = FALSE
to your dataframe. This would make your code run: Although am not sure whether the outputted result is the one you are looking for. To view the whole result, pipe it into a data.frame and see whether it is what you are looking for. Hope this helps.
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-3',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE))),stringsAsFactors = FALSE)
data %>%
group_by(date) %>%
distinct(id) %>%
do(lost = anti_join(., lag(.), by="id"))%>%data.frame()
some manipulation on data and a merge might do what you want. Something like this
df <- unique(data)
df$date <- as.Date(df$date)
df$leftdate <- df$date + 1
df$prevdate <- df$date - 1
df2 <- cbind(df[,c("date","id")],flag = 1)
# merge the dataframe so that each day would attempt to join the next day
dfleft <- merge(df,df2,by.x = c("leftdate","id"),by.y = c("date","id"),all.x= TRUE)
# if there is no common id between a day and the next day, the merge returns NA, which is the desired results for those who left
dfleft <- dfleft[is.na(dfleft$flag),c("leftdate","id")]
# Here, you reverse the logic to find those who show up today but weren't there yesterday
dfnew <- merge(df,df2,by.x = c("prevdate","id"),by.y = c("date","id"),all.x= TRUE)
dfnew <- dfnew[is.na(dfnew$flag),c("date","id")]
My understanding from the question is that data shows the id's at each date and we want to iterate through the dates comparing the ids in that date to the ids in the immediately prior date.
First get the unique rows u
and convert the id
to numeric. Then split id
by date
giving s
and define a function diffs
which produces a numeric vector of added id's using negative numbers for removed id's. lapply
that to seq_along(s) except for the first component as it has no prior component. No packages are used.
u <- unique(data)
u$id <- as.numeric(as.character(u$id))
s <- split(u$id, u$date)
diffs <- function(i) c(setdiff(s[[i]], s[[i-1]]), - setdiff(s[[i-1]], s[[i]]))
diffs_list <- setNames(lapply(seq_along(s)[-1], diffs), names(s)[-1])
giving:
> diffs_list
$`2015-02-02`
[1] 1010 -1011
$`2015-02-03`
[1] 1011 -1015 -1009 -1008 -1006
or if you want a data frame as output
setNames(stack(diffs_list), c("id", "date"))
giving:
id date
1 1010 2015-02-02
2 -1011 2015-02-02
3 1011 2015-02-03
4 -1015 2015-02-03
5 -1009 2015-02-03
6 -1008 2015-02-03
7 -1006 2015-02-03
magrittr
This could also be expressed using the magrittr package like this where diffs
is defined above.
library(magrittr)
data %>%
unique %>%
transform(id = as.numeric(as.character(id))) %>%
{ split(.$id, .$date) } %>%
{ setNames(lapply(seq_along(.)[-1], diffs), names(.)[-1]) }
Note: I have replaced -3 in data$date
with -03.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.