简体   繁体   English

R产生新的data.table减速

[英]R generating new data.table slowdown

I have a data.table listing intervals during which patients were exposed and the start and stop times during which they were observed for exposures. 我有一个data.table列表间隔,在此期间患者被暴露,以及他们被观察暴露的开始和停止时间。 Exposure happens over intervals during observation. 在观察期间间隔发生曝光。 I want to generate the opposite intervals in which the patients were unexposed. 我希望产生患者未暴露的相反间隔。

The data I have are formatted like the following: 我的数据格式如下:

library(data.table)
DT = fread("
id t0  t  s tn
1   1  2  4 15
1   1  6  7 15
1   1 10 12 15
2   4  5  7 44
2   4  9 15 44
2   4 17 35 44")

t0 is the start time of observation, t is the start of exposure, s is the end of exposure, and tn is the end of observation. t0是观察的开始时间,t是曝光的开始,s是曝光的结束,tn是观察的结束。 An example might be exposure to extreme UV in workers. 一个例子可能是工人接触极端紫外线。 ID 1 begins work in the first month of 2000, then in the second month is exposed to extreme UV in normal working conditions, and this stops in the fourth month as cloud cover persisted for 2 months. ID 1在2000年的第一个月开始工作,然后在第二个月在正常工作条件下暴露于极端紫外线,并且在第四个月停止,因为云层持续2个月。 The "s" ending time represents the clouds in the fourth month. “s”结束时间代表第四个月的云。 They persist for 2 months in "safe" conditions until the 6th month when UV rises again for another 2 months. 它们在“安全”条件下持续2个月,直到第6个月紫外线再次上升2个月。 And so on... 等等...

I want to generate the intervals in which the exposure did not happen based on these data. 我想基于这些数据生成发生曝光的间隔。 That is, when the participant is in the "safe" conditions. 也就是说,当参与者处于“安全”状态时。 The example output would be: 示例输出将是:

id  t  s
1   1  2
1   4  6
1   7 10
1  12 15
2   4  5
2   7  9
2  15 17
2  35 44

The first step I have is to set a data.table in the following manner: 我的第一步是以下列方式设置data.table:

nonexp <- DT[, .(t=c(t0[1], s), s=c(t, tn[1])), by=id]

but in my dataset of over 140,000 events, this is running VERY slow. 但是在我的140,000多个事件的数据集中,这种情况非常缓慢。 I am working in a remote compute environment that is heavily moderated, so I cannot tell whether the system is running slowly or my code is bad. 我正在一个高度主持的远程计算环境中工作,所以我无法判断系统运行缓慢或代码是否错误。

Is this code obviously suboptimal in some important ways? 这些代码在某些重要方面显然不是最理想的吗? Is there a faster way to do this? 有更快的方法吗?

I'd store the data so that time variables are not split over multiple columns: 我存储数据,以便时间变量不会分成多列:

# bounds table
bdDT = melt(unique(DT[, .(id, t0, tn)]), id = "id", value.name = "t")
bdDT[variable == "t0", status := "safe"]
bdDT[variable == "tn", status := "end"]
bdDT[, variable := NULL ]

# core table
treatDT = melt(DT, id="id", value.name = "t", meas = c("t", "s"))
treatDT[variable == "t", status := "treated"]
treatDT[variable == "s", status := "safe"]
treatDT[, variable := NULL ]

# stack
res = unique(rbind(treatDT, bdDT), by=c("id", "t"))
setkey(res, id, t)

The data now looks like 数据现在看起来像

    id  t  status
 1:  1  1    safe
 2:  1  2 treated
 3:  1  4    safe
 4:  1  6 treated
 5:  1  7    safe
 6:  1 10 treated
 7:  1 12    safe
 8:  1 15     end
 9:  2  4    safe
10:  2  5 treated
11:  2  7    safe
12:  2  9 treated
13:  2 15    safe
14:  2 17 treated
15:  2 35    safe
16:  2 44     end

From here, if you want to browse safe spells, there's... 从这里开始,如果你想浏览安全法术,那就......

> res[status == "safe"][res[status != "safe"], on=.(id, t), roll=TRUE, 
  .(id, start = x.t, end = i.t)
]

   id start end
1:  1     1   2
2:  1     4   6
3:  1     7  10
4:  1    12  15
5:  2     4   5
6:  2     7   9
7:  2    15  17
8:  2    35  44

Note: if some treatment goes all the way to tn , it will not show up here as a zero-length spell. 注意:如果某些治疗一直持续到tn ,它将不会显示为零长度法术。


Alternately, if you have enough RAM, a cleaner way is to expand the data... 或者,如果你有足够的RAM,更简洁的方法是扩展数据......

idDT = unique(DT[, .(id, start = t0, end = tn)], by="id")
fullDT = idDT[, .(t = start:end), by=id]

fullDT[, status := "safe"]
fullDT[DT, on=.(id, t >= t, t < s), status := "treated"]

From there, you can collapse to spells for easier browsing 从那里,您可以折叠拼写以便于浏览

fullDT[, 
  .(start = first(t), end = last(t))
, by=.(id, status, g = rleid(id, status))][, !"g"][,
  end := replace(end + 1L, .N, last(end))
, by=id][]

    id  status start end
 1:  1    safe     1   2
 2:  1 treated     2   4
 3:  1    safe     4   6
 4:  1 treated     6   7
 5:  1    safe     7  10
 6:  1 treated    10  12
 7:  1    safe    12  15
 8:  2    safe     4   5
 9:  2 treated     5   7
10:  2    safe     7   9
11:  2 treated     9  15
12:  2    safe    15  17
13:  2 treated    17  35
14:  2    safe    35  44

The replace step is needed because the OP writes overlapping end and start dates. 需要replace步骤,因为OP写入重叠的结束日期和开始日期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM