[英]Generating Data for Survival Analysis in r
I have a dataframe that record if an individual assumed a certain drug each year:我有一个数据框记录一个人是否每年服用某种药物:
df_og <- data.frame(
id=c(1,1,1,2,2,2,3,3,3,3),
year=c(2001,2002,2003,2001,2002,2003,2000,2001,2002,2003),
med1=c(1,1,1,1,1,0,0,0,0,1),
med2=c(0,0,0,0,0,1,0,0,1,0),
med3=c(0,0,0,0,0,0,1,1,0,0)
)
that looks like this:看起来像这样:
id year med1 med2 med3
1 2001 1 0 0
1 2002 1 0 0
1 2003 1 0 0
2 2001 1 0 0
2 2002 1 0 0
2 2003 0 1 0
3 2000 0 0 1
3 2001 0 0 1
3 2002 0 1 0
3 2003 1 0 0
So id
column shows id of the subject, year
the year of observation, and the med1-2-3
variables are dummy with value =1 if the drug has been taken and =0 if not.所以
id
列显示对象的 id,观察year
的年份,并且med1-2-3
变量是虚拟变量,如果已服用药物,则值为 =1,否则 =0。
I'm trying to create a new dataframe:我正在尝试创建一个新的数据框:
id = c(1,2,2,3,3,3),
time = c(3,2,1,2,1,1),
failure = c(0,1,0,1,1,0),
group = c(1,1,2,3,2,1))
looks like:好像:
id time failure med_group
1 3 0 1
2 2 1 1
2 1 0 2
3 2 1 3
3 1 1 2
3 1 0 1
where: id
shows subject id, time
counts the number of consecutive years a subject has been taking a certain drug, failure
if in the given years a subject switched drug, med_group
the drug the subject has been taking.其中:
id
显示主题ID, time
计数连续年数受试者已经服用某种药物, failure
在给定的年主题切换药, med_group
药物受试者已经了结。
Examples:例子:
df
, subject id=1
has taken med1
for 3 consecutive years, so time=3
and hasn't switched to others, so failure=0
.df
,科目id=1
已经连续拿了med1
3 年,所以time=3
还没有换别的,所以failure=0
。df
, id=2
has been taking med1
for 2 consecutive years, so time=2
, failure=0
, med_group=1
.df
, id=2
已经连续服用med1
2 年,所以time=2
, failure=0
, med_group=1
。 But then switched to med2
, so time=1
, failure=1
, and med_group=2
.med2
,所以time=1
、 failure=1
和med_group=2
。 and so on for the others.等等。 It's a tricky operation so I hope the question is clear enough.
这是一个棘手的操作,所以我希望问题足够清楚。
Any suggestion will be welcomed!任何建议将受到欢迎! Cheers
干杯
We can get the data in long format, remove rows where value = 0
, replace
the last value in each id
to 0 indicating no failure.我们可以获取长格式的数据,删除
value = 0
行, replace
每个id
的最后一个值replace
为 0 表示没有失败。 We then group_by
name
to count number of rows in each group and if failure
occurred or not.然后我们
group_by
name
来计算每个组中的行数以及是否发生failure
。
library(dplyr)
df_og %>%
tidyr::pivot_longer(cols = starts_with('med')) %>%
filter(value != 0) %>%
group_by(id) %>%
mutate(value = replace(value, n(), 0)) %>%
group_by(name, add = TRUE) %>%
summarise(time = n(),
failure = +all(value == 1))
# id name time failure
# <dbl> <chr> <int> <int>
#1 1 med1 3 0
#2 2 med1 2 1
#3 2 med2 1 0
#4 3 med1 1 0
#5 3 med2 1 1
#6 3 med3 2 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.