Summarising longitudinal data with dplyr

Question

I have a dataframe which looks something like this:

set.seed(100)

library(dplyr)

df <- tibble(ID = rep(1:4, each = 2),
              weight = rep(abs(rnorm(4, 5, 3)), each = 2),
              year = rep(2013:2014, 4),
              var1 = sample(1:5, 8, rep = TRUE),
              var2 = sample(1:5, 8, rep = TRUE))

Producing data which looks like this:

# A tibble: 8 x 5
     ID   weight  year  var1  var2
  <int>    <dbl> <int> <int> <int>
1     1 3.493423  2013     3     2
2     1 3.493423  2014     1     2
3     2 5.394593  2013     4     2
4     2 5.394593  2014     5     4
5     3 4.763249  2013     2     3
6     3 4.763249  2014     2     4
7     4 7.660354  2013     4     3
8     4 7.660354  2014     4     4

I wish to make quick, simple inference on how things are changing from one year to the next. The ID variable is a unique identifier for each person in my longitudinal sample.

My idea would be to use group_by(ID) to group my data by by their ID, and then perhaps make use of the summarise function in some way. I desire the "collapse" effect we see when we use the summarise function.

For example, say I want to see if var1 remains the same across the two years, by person. We see above this is true of persons 3 and 4. I would like to be able to obtain the following dataframe:

# A tibble: 4 x 3
     ID   weight indicator
  <int>    <dbl>     <lgl>
1     1 3.493423     FALSE
2     2 5.394593     FALSE
3     3 4.763249      TRUE
4     4 7.660354      TRUE

or, say I wanted to see the difference in var2 from 2013 to 2014, I would desire the following dataframe:

# A tibble: 4 x 3
     ID   weight diff_var2
  <int>    <dbl>     <dbl>
1     1 3.493423         0
2     2 5.394593         2
3     3 4.763249         1
4     4 7.660354         1

Does anyone have any ideas on how to go about this? I don't know how this would generalise to more years of data, but for the time being I am simply working with two years of longitudinal data.

Ultimately, for example, I would like to know the weighted proportion of people whose var1 does not change, or the weighted mean movement in var2 etc. These are just some examples of the sorts of queries I am looking into.

Answer 1

You've pretty much already laid out what you need to do, but group by both ID and weight if you want to save the columns.

df %>% group_by(ID, weight) %>% 
    summarise(indicator = n_distinct(var1) < n(), 
              diff_var2 = diff(var2))

## Source: local data frame [4 x 4]
## Groups: ID [?]
## 
##      ID   weight indicator diff_var2
##   <int>    <dbl>     <lgl>     <int>
## 1     1 3.493423     FALSE         0
## 2     2 5.394593     FALSE         2
## 3     3 4.763249      TRUE         1
## 4     4 7.660354      TRUE         1

If you have more than two years or missing data, you may need a more robust approach.

Answer 2

We can use data.table

 library(data.table)
 setDT(df)[, .(indicator=uniqueN(var1)==1, diff_var2= diff(var2)), ID]
 #   ID indicator diff_var2
 #1:  1     FALSE         0
 #2:  2     FALSE         2
 #3:  3      TRUE         1
 #4:  4      TRUE         1

Summarising longitudinal data with dplyr

Question

2 answers

solution1
2 ACCPTED 2017-01-06 23:35:07

solution2
0 2017-01-07 04:30:49

Summarising longitudinal data with dplyr

Question

2 answers

solution1 2 ACCPTED 2017-01-06 23:35:07

solution2 0 2017-01-07 04:30:49

solution1
2 ACCPTED 2017-01-06 23:35:07

solution2
0 2017-01-07 04:30:49