[英]Transform data to start-stop / long format
I want to do survival analysis with a cox regression. 我想用Cox回归进行生存分析。 Therefore, I need to transform my data according to observation times to a start-stop format instead of just one time observation.
因此,我需要根据观察时间将数据转换为起止格式,而不仅仅是一次观察。
Example dataset: 示例数据集:
userid indicates user identifier userid表示用户标识符
day indicates days since first event day表示自第一次活动以来的天数
status indicates if the event of interest happened this day (1 = yes, 0 = no) 状态指示感兴趣的事件是否在这一天发生(1 =是,0 =否)
da1 <- data.frame(userid = c(1,1,1,2,2,2,3,3,3), day= c(1,2,3,1,2,3,1,2,3), status = c(0,0,1,1,0,0,0,1,1))
da1
userid day status
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 1
I want to have my data this format: 我想让我的数据采用以下格式:
da2 <- data.frame(userid = c(1,1,1,2,2,2,3,3,3), startday= c(0,1,2,0,1,2,0,1,2), endday = c(1,2,3,1,2,3,1,2,3), status = c(0,0,1,1,0,0,0,1,1))
da2
userid startday endday status
1 1 0 1 0
2 1 1 2 0
3 1 2 3 1
4 2 0 1 1
5 2 1 2 0
6 2 2 3 0
7 3 0 1 0
8 3 1 2 1
9 3 2 3 1
It would be great, if I would also have some code to aggregate the observations if more than one day occurs without an event in a row. 如果我连续一天都没有发生任何事件,那么如果我也有一些代码来汇总观察结果,那就太好了。
da3 <- data.frame(userid = c(1,1,2,2,3,3,3), startday= c(0,2,0,2,0,1,2), endday = c(2,3,1,3,1,2,3), status = c(0,1,1,0,0,1,1))
da3
userid startday endday status
1 1 0 2 0
2 1 2 3 1
3 2 0 1 1
4 2 2 3 0
5 3 0 1 0
6 3 1 2 1
7 3 2 3 1
I have tried the following code, but it gives wrong results: 我尝试了以下代码,但给出了错误的结果:
for (i in 1:max(da$userid)){
obst<-sort(unique(da$day))
stpT<-obst[1:which(obst==da$day[i])]
id<-rep(i,length(stpT))
stat<-c(rep(0,length(stpT)-1),da$status[i])
strT<-lag(stpT,1);strT[1]=0
iln<-stpT-strT
df<-data.frame(userid=id,Start=strT,Stop=stpT,Status=stat,ILen=iln)
if(i==1){data_obs=df}
else{data_obs=rbind(data_obs,df)}
}
data_obs<-merge(data_obs,data[,c('userid','X')],by='userid')
dim(data_obs)
We can group_by
userid
and create a sequence for startday
from 0 to max
value of day
in group and endday
from 1 to max(day)
. 我们可以
group_by
userid
并创建一个序列startday
从0到max
的价值day
组和endday
从1到max(day)
。
library(dplyr)
da1 %>%
group_by(userid) %>%
mutate(startday = seq(0, max(day) - 1),
endday = seq(max(day))) %>%
select(-day)
# userid status startday endday
# <dbl> <dbl> <int> <int>
#1 1 0 0 1
#2 1 0 1 2
#3 1 1 2 3
#4 2 1 0 1
#5 2 0 1 2
#6 2 0 2 3
#7 3 0 0 1
#8 3 1 1 2
#9 3 1 2 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.