在 r 中为生存分析生成数据

Question

I have a dataframe that record if an individual assumed a certain drug each year:我有一个数据框记录一个人是否每年服用某种药物：

df_og <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3,3),
  year=c(2001,2002,2003,2001,2002,2003,2000,2001,2002,2003),
  med1=c(1,1,1,1,1,0,0,0,0,1),
  med2=c(0,0,0,0,0,1,0,0,1,0),
  med3=c(0,0,0,0,0,0,1,1,0,0)
)

that looks like this:看起来像这样：

id  year   med1 med2 med3 
1   2001    1    0    0
1   2002    1    0    0
1   2003    1    0    0
2   2001    1    0    0
2   2002    1    0    0
2   2003    0    1    0
3   2000    0    0    1
3   2001    0    0    1
3   2002    0    1    0
3   2003    1    0    0

So id column shows id of the subject, year the year of observation, and the med1-2-3 variables are dummy with value =1 if the drug has been taken and =0 if not.所以id列显示对象的 id，观察year的年份，并且med1-2-3变量是虚拟变量，如果已服用药物，则值为 =1，否则 =0。

I'm trying to create a new dataframe:我正在尝试创建一个新的数据框：

  id = c(1,2,2,3,3,3),
  time = c(3,2,1,2,1,1),
  failure = c(0,1,0,1,1,0),
  group = c(1,1,2,3,2,1))

looks like:好像：

  id  time failure med_group
   1   3      0        1
   2   2      1        1
   2   1      0        2
   3   2      1        3
   3   1      1        2
   3   1      0        1

where: id shows subject id, time counts the number of consecutive years a subject has been taking a certain drug, failure if in the given years a subject switched drug, med_group the drug the subject has been taking.其中： id显示主题ID， time计数连续年数受试者已经服用某种药物， failure在给定的年主题切换药， med_group药物受试者已经了结。

Examples:例子：

first row of df , subject id=1 has taken med1 for 3 consecutive years, so time=3 and hasn't switched to others, so failure=0 .第一排df ，科目id=1已经连续拿了med1 3 年，所以time=3还没有换别的，所以failure=0 。
second row of df , id=2 has been taking med1 for 2 consecutive years, so time=2 , failure=0 , med_group=1 .第二排df , id=2已经连续服用med1 2 年，所以time=2 , failure=0 , med_group=1 。 But then switched to med2 , so time=1 , failure=1 , and med_group=2 .但后来切换到med2 ，所以time=1 、 failure=1和med_group=2 。

and so on for the others.等等。 It's a tricky operation so I hope the question is clear enough.这是一个棘手的操作，所以我希望问题足够清楚。

Any suggestion will be welcomed!任何建议将受到欢迎！ Cheers干杯

Answer 1

We can get the data in long format, remove rows where value = 0 , replace the last value in each id to 0 indicating no failure.我们可以获取长格式的数据，删除value = 0行， replace每个id的最后一个值replace为 0 表示没有失败。 We then group_by name to count number of rows in each group and if failure occurred or not.然后我们group_by name来计算每个组中的行数以及是否发生failure 。

library(dplyr)

df_og %>%
  tidyr::pivot_longer(cols = starts_with('med')) %>%
  filter(value != 0) %>%
  group_by(id) %>%
  mutate(value = replace(value, n(), 0)) %>%
  group_by(name, add = TRUE) %>%
  summarise(time = n(), 
            failure = +all(value == 1))


#     id name   time failure
#  <dbl> <chr> <int>   <int>
#1     1 med1      3       0
#2     2 med1      2       1
#3     2 med2      1       0
#4     3 med1      1       0
#5     3 med2      1       1
#6     3 med3      2       1

在 r 中为生存分析生成数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-31 09:32:25

在 r 中为生存分析生成数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-31 09:32:25

解决方案1
1 已采纳 2020-03-31 09:32:25