从 R 中的区间 [开始，停止] 数据估计密度

Question

Description描述

The motivation for this question is from clinical/epidemiological research, wherein studies often enroll patients and then follow them for variable lengths of time.这个问题的动机来自临床/流行病学研究，其中研究经常招募患者，然后跟踪他们不同的时间长度。

The distribution of age at study entry is often of interest and is easily assessed, however there is occasional interest in the distribution of age at any time during the study .进入研究时的年龄分布通常很有趣，并且很容易评估，但是在研究期间的任何时间，偶尔也会对年龄分布感兴趣。

My question is, is there a method for estimating such a density from interval data such as [age_start, age_stop] without expansion of the data as below ?我的问题是，是否有一种方法可以从区间数据（例如 [age_start, age_stop] ）中估计这样的密度而无需扩展如下数据？ The long-format method seems inelegant, to say nothing of its memory usage!长格式方法似乎不雅，更不用说它的内存使用了！

Reproducible example using data from the survival package使用来自生存包的数据的可重现示例

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
  # use data on time to death (not recurrence)
  filter(etype == 2) %>%
  # age at end of follow-up (death or censoring)
  mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
  geom_density() +
  labs(title = "Fig 1.",
       x = "Age at Entry (years)",
       y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 2: Density based on approximate person-months",
       x = "Age (years)",
       y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 3: Density based on person-days",
       x = "Age (years)",
       y = "Density")

图 3

Note: while I tagged this question with "survival" because I thought it would attract people familiar with this area, I am not interested in time-to-event here, just the overall age distribution of all time under study.注意：虽然我将这个问题标记为“生存”，因为我认为它会吸引熟悉该领域的人，但我对这里的事件发生时间不感兴趣，只是对所有研究时间的总体年龄分布感兴趣。

Answer 1

Rather than calculate for finer and finer time intervals you can just keep a cumulative count of the number of patients at a particular age而不是计算越来越精细的时间间隔，您只需保留特定年龄患者数量的累积计数

setDT(ccdeath)
x <- rbind(
  ccdeath[, .(age = age, num_patients = 1)],
  ccdeath[, .(age = age_last, num_patients = -1)]
)[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = unique(age))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

The sawtooth pattern is because every patient is assumed to start at an integer age.锯齿模式是因为假设每个患者都从整数年龄开始。 Had some thoughts about how you'd smooth this and came up with this idea - assign equal probabilites to a set of evenly spaced ages between the given age and age+1 .对如何平滑这一点有一些想法，并提出了这个想法 - 将相等的概率分配给给定age和age+1之间的一组均匀间隔的年龄。 You get something like this,你得到这样的东西，

smooth_param <- 12
x <- rbindlist(lapply(
  (1:smooth_param-0.5)/smooth_param,
  function(s) {
    rbind(
      ccdeath[, .(age = age+s, num_patients = 1/smooth_param)],
      ccdeath[, .(age = age_last+s, num_patients = -1/smooth_param)]
    )
  }
))[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = sort(unique(age)))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

Answer 2

I would proceed along these lines:我会沿着这些路线进行：

If you are interested in knowing the age distribution after t days in the study, the age will simply be the age at enrollment plus t days.如果您有兴趣了解研究中t天后的年龄分布，则年龄将简单地为入学年龄加上t天。 The exceptions that you need to handle those who have died or have been right-censored.您需要处理那些已经死亡或被右删的例外情况。 In your example, you seem to have kept people's age "frozen" at the time they left the study.在您的示例中，您似乎在人们离开研究时“冻结”了他们的年龄。 Personally I think the age distribution of survivors who have not been censored is more useful in a survival analysis, but I will stick to your set-up for this example.就个人而言，我认为未经审查的幸存者的年龄分布在生存分析中更有用，但我将坚持您在此示例中的设置。

The two possibilities for each patient at time t then are to have age at enrollment plus t if t is less than follow-up time.如果t小于随访时间，则每个患者在时间t的两种可能性是登记时的年龄加上t 。 Otherwise the age will be the age at enrollment plus the follow-up time.否则，年龄将是入学时的年龄加上随访时间。

If you wrap this in a function, you can see how the age distribution changes throughout the study.如果将其包装在一个函数中，您可以看到整个研究中年龄分布的变化。 For completeness we will always plot a faint density of age at enrollment, and a line indicating the current mean age:为完整起见，我们将始终绘制入学时的微弱年龄密度，以及指示当前平均年龄的线：

age_distribution <- function(df, t = 0)
{
  df %>% 
    mutate(age_at_t = age + ifelse(time > t, t, time) / 365.25) %>%
    ggplot() +
    geom_density(aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

So, for example:因此，例如：

age_distribution(ccdeath, 0)

And after 1 year: 1年后：

age_distribution(ccdeath, 365)

And after 5 years: 5年后：

age_distribution(ccdeath, 5 * 365.25)

For completeness, the equivalent function with censored / dead patients removed would be like this:为了完整起见，删除了删失/死亡患者的等效函数如下所示：

age_distribution <- function(df, t = 0)
{
  df %>% 
    filter(time > t) %>%
    mutate(age_at_t = age + t / 365.25) %>%
    ggplot() +
    geom_density(data = df, aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

So we can see how the distribution's shape changes after 5 years of follow-up:所以我们可以看到在 5 年的随访后分布的形状是如何变化的：

age_distribution(ccdeath, 5 * 365.25)

This shows more clearly that there is a disproportionate loss of older people from the initial cohort.这更清楚地表明，从最初的队列中，老年人流失的比例不成比例。

从 R 中的区间 [开始，停止] 数据估计密度

问题描述

Description描述

Reproducible example using data from the survival package使用来自生存包的数据的可重现示例

2 个解决方案

解决方案1
0 2020-09-04 21:26:07

解决方案2
0 2020-09-04 22:00:25

从 R 中的区间 [开始，停止] 数据估计密度

问题描述

Description描述

Reproducible example using data from the survival package使用来自生存包的数据的可重现示例

2 个解决方案

解决方案1 0 2020-09-04 21:26:07

解决方案2 0 2020-09-04 22:00:25

解决方案1
0 2020-09-04 21:26:07

解决方案2
0 2020-09-04 22:00:25