简体   繁体   中英

How to fill values between two factors in R?

How to fill the 'duration' column with 1 between 'start' and 'end' indicators as the examples below?

In Stata it would be:

by id (year), sort: gen duration=1 if start==1
by id (year), sort: replace duration=1 if duration[_n-1]==1 & end!=1

How could I do this in R, possibly using Dplyr?

id  year    start   end 
1   2000    0       0   
1   2001    1       0   
1   2002    0       0   
1   2003    0       1   
1   2004    0       0   
2   2000    0       0   
2   2001    0       0   
2   2002    1       0   
2   2003    0       0   
2   2004    0       1   

Output would be:

id  year    start   end duration
1   2000    0       0   0
1   2001    1       0   1
1   2002    0       0   1
1   2003    0       1   0
1   2004    0       0   0
2   2000    0       0   0
2   2001    0       0   0
2   2002    1       0   1
2   2003    0       0   1
2   2004    0       1   0

Using dplyr , this seems to do the trick. First, the sample data

dd<-read.table(text="id  year    start   end 
1   2000    0       0   
1   2001    1       0   
1   2002    0       0   
1   2003    0       1   
1   2004    0       0   
2   2000    0       0   
2   2001    0       0   
2   2002    1       0   
2   2003    0       0   
2   2004    0       1", header=T)

now we just group by ID, then we use cumsum to look for changes in start and end

library(dplyr)
dd %>% group_by(id) %>% mutate(duration = cumsum(start-end))

#       id  year start   end duration
#    (int) (int) (int) (int)    (int)
# 1      1  2000     0     0        0
# 2      1  2001     1     0        1
# 3      1  2002     0     0        1
# 4      1  2003     0     1        0
# 5      1  2004     0     0        0
# 6      2  2000     0     0        0
# 7      2  2001     0     0        0
# 8      2  2002     1     0        1
# 9      2  2003     0     0        1
# 10     2  2004     0     1        0

Using similar logic to the code you provided:

#Load dplyr
require(dplyr)

#Make data
df <- data.frame("id" = c(1,1,1,1,1,2,2,2,2,2),
             "year" = c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004),
             "start" = c(0,1,0,0,0,0,0,1,0,0),
             "end" = c(0,0,0,1,0,0,0,0,0,1))

#Order by Year and ID
df <- df[order(df$id,df$year),]

#Make new variable
df$duration <- 0
df$duration[df$start==1 & df$end != 1] <- 1
df$duration[lag(df$duration,1)==1 & df$end ==0] <-1

We can use base R

df1$duration <- with(df1, ave(start-end, id, FUN = cumsum))
df1
#   id year start end duration
#1   1 2000     0   0        0
#2   1 2001     1   0        1
#3   1 2002     0   0        1
#4   1 2003     0   1        0
#5   1 2004     0   0        0
#6   2 2000     0   0        0
#7   2 2001     0   0        0
#8   2 2002     1   0        1
#9   2 2003     0   0        1
#10  2 2004     0   1        0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM