简体   繁体   中英

Imputation for longitudinal data using observation before and after missing data

I'm in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I'm wondering how I can go about addressing the issues detailed below.

I've been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I'm at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold characters represent changes from the dataset above

The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this:
1,3,2,3, 1.5 ,0,0,

ID# 2 (variable ss) should look like this:
2,4,0, 0 ,0,0,0

ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this:
4,1,2,4,2,3, 3

ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this:
2,1,0,NA,NA,0,0 ( no change ).

I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.

smwrBase::fillMissing(ss, max.fill=1)

The zoo package might be more standard, same issue though.

zoo::na.approx(ss, maxgap=1)

Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.

> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+   # interpolate for gaps
+   mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+   # extension for gap as last value
+   if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+     mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+       mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+   }
+ }
> mydat
   id time ss ss2
1   1    0  1 1.0
2   1    1  3 3.0
3   1    2  2 2.0
4   1    3  3 3.0
5   1    4 NA 1.5
6   1    5  0 0.0
7   1    6  0 0.0
8   2    0  2 2.0
9   2    1  4 4.0
10  2    2  0 0.0
11  2    3 NA 0.0
12  2    4  0 0.0
13  2    5  0 0.0
14  2    6  0 0.0
15  3    0  4 4.0
16  3    1  1 1.0
17  3    2  2 2.0
18  3    3  4 4.0
19  3    4  2 2.0
20  3    5  3 3.0
21  3    6 NA 3.0
22  4    0  2 2.0
23  4    1  1 1.0
24  4    2  0 0.0
25  4    3 NA  NA
26  4    4 NA  NA
27  4    5  0 0.0
28  4    6  0 0.0

The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM