简体   繁体   English

将数据转换为开始/停止/长格式

[英]Transform data to start-stop / long format

I want to do survival analysis with a cox regression. 我想用Cox回归进行生存分析。 Therefore, I need to transform my data according to observation times to a start-stop format instead of just one time observation. 因此,我需要根据观察时间将数据转换为起止格式,而不仅仅是一次观察。

Example dataset: 示例数据集:

  • userid indicates user identifier userid表示用户标识符

  • day indicates days since first event day表示自第一次活动以来的天数

  • status indicates if the event of interest happened this day (1 = yes, 0 = no) 状态指示感兴趣的事件是否在这一天发生(1 =是,0 =否)

da1 <- data.frame(userid = c(1,1,1,2,2,2,3,3,3), day= c(1,2,3,1,2,3,1,2,3), status = c(0,0,1,1,0,0,0,1,1))

da1
  userid day status
1      1   1      0
2      1   2      0
3      1   3      1
4      2   1      1
5      2   2      0
6      2   3      0
7      3   1      0
8      3   2      1
9      3   3      1

I want to have my data this format: 我想让我的数据采用以下格式:

da2 <- data.frame(userid = c(1,1,1,2,2,2,3,3,3), startday= c(0,1,2,0,1,2,0,1,2), endday = c(1,2,3,1,2,3,1,2,3), status = c(0,0,1,1,0,0,0,1,1))

da2
  userid startday endday status
1      1        0      1      0
2      1        1      2      0
3      1        2      3      1
4      2        0      1      1
5      2        1      2      0
6      2        2      3      0
7      3        0      1      0
8      3        1      2      1
9      3        2      3      1

It would be great, if I would also have some code to aggregate the observations if more than one day occurs without an event in a row. 如果我连续一天都没有发生任何事件,那么如果我也有一些代码来汇总观察结果,那就太好了。

da3 <- data.frame(userid = c(1,1,2,2,3,3,3), startday= c(0,2,0,2,0,1,2), endday = c(2,3,1,3,1,2,3), status = c(0,1,1,0,0,1,1))

da3
  userid startday endday status
1      1        0      2      0
2      1        2      3      1
3      2        0      1      1
4      2        2      3      0
5      3        0      1      0
6      3        1      2      1
7      3        2      3      1

I have tried the following code, but it gives wrong results: 我尝试了以下代码,但给出了错误的结果:

for (i in 1:max(da$userid)){
  obst<-sort(unique(da$day))
  stpT<-obst[1:which(obst==da$day[i])]
  id<-rep(i,length(stpT))
  stat<-c(rep(0,length(stpT)-1),da$status[i])                            
  strT<-lag(stpT,1);strT[1]=0 
  iln<-stpT-strT

  df<-data.frame(userid=id,Start=strT,Stop=stpT,Status=stat,ILen=iln)
  if(i==1){data_obs=df}
  else{data_obs=rbind(data_obs,df)}
}

data_obs<-merge(data_obs,data[,c('userid','X')],by='userid')
dim(data_obs)

We can group_by userid and create a sequence for startday from 0 to max value of day in group and endday from 1 to max(day) . 我们可以group_by userid并创建一个序列startday从0到max的价值day组和endday从1到max(day)

library(dplyr)

da1 %>%
  group_by(userid) %>%
  mutate(startday = seq(0, max(day) - 1), 
         endday = seq(max(day))) %>%
  select(-day)

#  userid status startday endday
#   <dbl>  <dbl>    <int>  <int>
#1      1      0        0      1
#2      1      0        1      2
#3      1      1        2      3
#4      2      1        0      1
#5      2      0        1      2
#6      2      0        2      3
#7      3      0        0      1
#8      3      1        1      2
#9      3      1        2      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM