简体   繁体   中英

Summarising longitudinal data with dplyr

I have a dataframe which looks something like this:

set.seed(100)

library(dplyr)

df <- tibble(ID = rep(1:4, each = 2),
              weight = rep(abs(rnorm(4, 5, 3)), each = 2),
              year = rep(2013:2014, 4),
              var1 = sample(1:5, 8, rep = TRUE),
              var2 = sample(1:5, 8, rep = TRUE))

Producing data which looks like this:

# A tibble: 8 x 5
     ID   weight  year  var1  var2
  <int>    <dbl> <int> <int> <int>
1     1 3.493423  2013     3     2
2     1 3.493423  2014     1     2
3     2 5.394593  2013     4     2
4     2 5.394593  2014     5     4
5     3 4.763249  2013     2     3
6     3 4.763249  2014     2     4
7     4 7.660354  2013     4     3
8     4 7.660354  2014     4     4

I wish to make quick, simple inference on how things are changing from one year to the next. The ID variable is a unique identifier for each person in my longitudinal sample.

My idea would be to use group_by(ID) to group my data by by their ID, and then perhaps make use of the summarise function in some way. I desire the "collapse" effect we see when we use the summarise function.

For example, say I want to see if var1 remains the same across the two years, by person. We see above this is true of persons 3 and 4. I would like to be able to obtain the following dataframe:

# A tibble: 4 x 3
     ID   weight indicator
  <int>    <dbl>     <lgl>
1     1 3.493423     FALSE
2     2 5.394593     FALSE
3     3 4.763249      TRUE
4     4 7.660354      TRUE

or, say I wanted to see the difference in var2 from 2013 to 2014, I would desire the following dataframe:

# A tibble: 4 x 3
     ID   weight diff_var2
  <int>    <dbl>     <dbl>
1     1 3.493423         0
2     2 5.394593         2
3     3 4.763249         1
4     4 7.660354         1

Does anyone have any ideas on how to go about this? I don't know how this would generalise to more years of data, but for the time being I am simply working with two years of longitudinal data.

Ultimately, for example, I would like to know the weighted proportion of people whose var1 does not change, or the weighted mean movement in var2 etc. These are just some examples of the sorts of queries I am looking into.

You've pretty much already laid out what you need to do, but group by both ID and weight if you want to save the columns.

df %>% group_by(ID, weight) %>% 
    summarise(indicator = n_distinct(var1) < n(), 
              diff_var2 = diff(var2))

## Source: local data frame [4 x 4]
## Groups: ID [?]
## 
##      ID   weight indicator diff_var2
##   <int>    <dbl>     <lgl>     <int>
## 1     1 3.493423     FALSE         0
## 2     2 5.394593     FALSE         2
## 3     3 4.763249      TRUE         1
## 4     4 7.660354      TRUE         1

If you have more than two years or missing data, you may need a more robust approach.

We can use data.table

 library(data.table)
 setDT(df)[, .(indicator=uniqueN(var1)==1, diff_var2= diff(var2)), ID]
 #   ID indicator diff_var2
 #1:  1     FALSE         0
 #2:  2     FALSE         2
 #3:  3      TRUE         1
 #4:  4      TRUE         1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM