简体   繁体   English

寻找有效的方法来查询R或Stata中的子组观察

[英]Looking for efficient way to query sub-group observations in R or Stata

I want to write a function that performs an operation on each record in a dataset, based on all other records within a subgroup of unique values [id]. 我想编写一个函数,根据唯一值[id]子组中的所有其他记录对数据集中的每个记录执行操作。 I'm very new to R, but I know that you can query a subset of records based on a condition using the following: 我对R很新,但我知道您可以使用以下内容根据条件查询记录子集:

df$date[id == "1234"]

Is it possible to replace "1234" with a variable derived from the unique row that the function is operating on? 是否可以将“1234”替换为从函数运行的唯一行派生的变量? Something like... 就像是...

df$date[id == df$id]

, so that it pulls values of [date] where [id] matches [id] of the index row? ,以便它拉[date]的值,其中[id]匹配索引行的[id]? In practice I would to use this in a loop, where for values of x, I can query a specific [date] value using: 实际上我会在循环中使用它,对于x的值,我可以使用以下方法查询特定的[date]值:

df$date[id == df$id & order == x]

My dataset has multiple records for each unique [id]. 我的数据集为每个唯一[id]都有多个记录。 Ultimately, I would like to compare the [date_1] value for each record to the [date_2] for all other records that are in each index record's [id] subgroup. 最后,我想将每条记录的[date_1]值与每个索引记录的[id]子组中所有其他记录的[date_2]进行比较。 The data looks something like this: 数据看起来像这样:

[id] | [order] | [date_1] | [date_2] |
-------------------------------------- 
  A  |    1    |    1/1   |    1/30  |
  A  |    2    |    1/5   |    1/5   |
  A  |    3    |    1/7   |    1/8   |
  A  |    4    |    1/9   |    1/9   |
 -------------------------------------
  B  |    1    |    3/7   |    3/10  |
  B  |    2    |    4/1   |    4/9   |
--------------------------------------

Though this could be done by looping through each unique value [id] and then cycling through each unique value [order], the number of records (5-10 million) proves that approach to be extremely slow and resource intensive. 虽然这可以通过循环遍历每个唯一值[id]然后循环遍历每个唯一值[order]来完成,但记录数量(5-10百万)证明该方法极其缓慢且资源密集。 I'm wondering if there is a more efficient way to simply loop through the [order] value and then compute this operation for every record simultaneously. 我想知道是否有一种更有效的方法来简单地循环[order]值,然后同时为每个记录计算此操作。

As I said, I'm new to R, so I'm not sure exact syntax of everything yet, but I'm picturing something like this: 正如我所说,我是R的新手,所以我不确定所有内容的确切语法,但我想象的是这样的:

for x = 1/max(order){ 
    df$episode_start <- 1 if df$date_1 - df$date_2[id == df$id & order == x] > 1
    }

I can provide more detail on the overall objective of this project, if it would be useful. 如果它有用,我可以提供有关该项目总体目标的更多细节。 In short, these data are hospital records, and the goal is to identify records that begin a new segment, which is defined as an encounter that has no prior discharge within 1 day of admission. 简而言之,这些数据是医院记录,目标是识别开始新分段的记录,该分段被定义为在入院后1天内没有事先出院的遭遇。 The data becomes tricky in that there are overlapping records (eg if a patient was an inpatient in long-term care, and had to go for an outpatient visit to the emergency department) -- in the example above A2 and A3 look like they are new encounters based on the discharge date [date_2] of the prior record, however A2, A3 and A4 all occurred during the span of A1, therefore the result should look like this: 数据变得棘手,因为记录重叠(例如,如果患者是长期护理的住院病人,并且不得不去急诊室门诊就诊) - 在上面的例子中,A2和A3看起来像是基于先前记录的排放日期[date_2]的新遭遇,但A2,A3和A4都发生在A1的跨度内,因此结果应如下所示:

[id] | [order] | [date_1] | [date_2] | [episode_start]
------------------------------------------------------ 
  A  |    1    |    1/1   |    1/30  |       1
  A  |    2    |    1/5   |    1/5   |       0
  A  |    3    |    1/7   |    1/8   |       0
  A  |    4    |    1/9   |    1/9   |       0
 -----------------------------------------------------
  B  |    1    |    3/7   |    3/10  |       1
  B  |    2    |    4/1   |    4/9   |       1
------------------------------------------------------

Thanks in advance. 提前致谢。 Any help or direction is much appreciated. 任何帮助或方向都非常感谢。 Note: I primarily work in Stata, and attempted to use the -bysort- command to do something similar, but to no avail. 注意:我主要在Stata工作,并试图使用-bysort-命令做类似的事情,但无济于事。 Thought maybe R was more suited for this. 想到也许R更适合这个。 Open to suggestions using either. 使用其中任何一个的建议。

The problem of overlapping hospital stays shows up from time to time on Statalist. 在Statalist上不时出现医院住院重叠的问题。 See an example here . 在这里查看示例。 The solution is to convert the admission/discharge date dyad to long form and to order both events chronologically. 解决方案是将入院/出院日期dyad转换为长形式并按时间顺序排列两个事件。 A new hospital spell is either the first observation for a patient or if the patient was out of the hospital at the end of the day of the previous observation. 新的医院咒语要么是患者的第一次观察,要么是在前一次观察结束时患者离开医院时。 Here's an example with data derived from Bulat's R solution (modified to add 2 additional stays): 以下是从Bulat的R解决方案中获得的数据示例(经过修改以增加2个额外的停留时间):

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id byte order str10(date_1 date_2)
"A" 1 "2016-01-01" "2016-01-30"
"A" 2 "2016-01-05" "2016-01-05"
"A" 3 "2016-01-07" "2016-01-08"
"A" 4 "2016-01-09" "2016-01-09"
"A" 5 "2016-02-09" "2016-02-09"
"B" 1 "2016-03-07" "2016-03-10"
"B" 2 "2016-03-08" "2016-03-08"
"B" 3 "2016-04-01" "2016-04-9"
end

gen ndate1 = date(date_1,"YMD")
gen ndate2 = date(date_2,"YMD")
format %td ndate1 ndate2

* confirm that each observation is uniquely identified by
isid id order, sort

* reshape to long; event==1 => admission; event==2 => discharge
reshape long ndate, i(id order) j(event)

* push the discharge date a day later (to make consecutive stays overlap)
replace ndate = ndate + 1 if event == 2

* define an inout increment for admission and discharge events
bysort id order (event): gen inout = cond(_n==1,1,-1)

* for each patient, sort events by date; for multiple events on the same day,
* put admissions before discharge
gsort id ndate -event
by id: gen eventsum = sum(inout)

* if the previous eventsum is 0, a new hospitalization spell starts
by id: gen spell = sum(_n == 1 | eventsum[_n-1] == 0)

* return to the original wide form data
keep if inout == 1

* flag the first obs of each spell
bysort id spell (ndate order): gen newspell = _n == 1

list id order date_1 date_2 spell newspell, sepby(id spell)

and the results: 结果:

. list id order date_1 date_2 spell newspell, sepby(id spell)

     +---------------------------------------------------------+
     | id   order       date_1       date_2   spell   newspell |
     |---------------------------------------------------------|
  1. |  A       1   2016-01-01   2016-01-30       1          1 |
  2. |  A       2   2016-01-05   2016-01-05       1          0 |
  3. |  A       3   2016-01-07   2016-01-08       1          0 |
  4. |  A       4   2016-01-09   2016-01-09       1          0 |
     |---------------------------------------------------------|
  5. |  A       5   2016-02-09   2016-02-09       2          1 |
     |---------------------------------------------------------|
  6. |  B       1   2016-03-07   2016-03-10       1          1 |
  7. |  B       2   2016-03-08   2016-03-08       1          0 |
     |---------------------------------------------------------|
  8. |  B       3   2016-04-01    2016-04-9       2          1 |
     +---------------------------------------------------------+

Here is something to get you started using data.table package in R: data.table你开始在R中使用data.table包的东西:

data <- read.table(text = "id order date_1 date_2 
A 1 2016-01-01 2016-01-30 
A 2 2016-01-05 2016-01-05
A 3 2016-01-07 2016-01-08
A 4 2016-01-09 2016-01-09
B 1 2016-03-07 2016-03-10
B 2 2016-04-01 2016-04-9", header = T)
library(data.table)
data$date_1 <- as.Date(data$date_1)
data$date_2 <- as.Date(data$date_2)
dt <- data.table(data, key = c("date_1", "date_2"))

res <- foverlaps(dt, dt, by.x = c("date_1", "date_2"), by.y = c("date_1", "date_2"))

# Remove matches from irrelevant groups.
res <- res[id == i.id]

# Find the period start date.
res[, min.date := min(i.date_1), by = .(id, order)]
res[, period.start := (date_1 == min.date)]

# Order records according to the period start date.
res <- res[order(id, order, i.date_1)]
# Remove duplicate rows
res <- res[, .SD[1], by = .(id, order)]

# Print resutls.
res[, .(id, order, date_1, date_2, period.start)][]

#       id order     date_1     date_2 period.start
# 1:  A     1 2016-01-01 2016-01-30         TRUE
# 2:  A     2 2016-01-05 2016-01-05        FALSE
# 3:  A     3 2016-01-07 2016-01-08        FALSE
# 4:  A     4 2016-01-09 2016-01-09        FALSE
# 5:  B     1 2016-03-07 2016-03-10         TRUE
# 6:  B     2 2016-04-01 2016-04-09         TRUE

One convenient way to get to the subsets for processing is to use by . 到达子集进行处理的一种便捷方法是使用by That will automatically subset your data.frame (in this case by ID) and allow you to focus on the handling of the records for each ID. 这将自动对data.frame进行子集化(在本例中为ID),并允许您专注于处理每个ID的记录。

result <- by(df, df$id, function(x){
              ## identify start dates for sub-group
             })

However, I suspect you'll still find that to be slow. 但是,我怀疑你仍然会发现它很慢。 Using data.table as suggested in another answer should help with that. 使用另一个答案中建议的data.table应该有所帮助。

You could further speed up processing by parallelising this over ID groups.Take a look at the foreach package to help with that. 您可以通过在ID组上并行化来进一步加快处理速度。看一下foreach包来帮助解决这个问题。 It allows you to write code like this (assuming df$id is a factor): 它允许你编写这样的代码(假设df $ id是一个因素):

foreach(i = levels(df$id)) %dopar% {
    ## Identify start dates for group i
}

I'd solve this using the dplyr package, a fantatsic data manipulation tool you can install by running install.packages('dplyr') and then library('dplyr') . 我将使用dplyr软件包解决这个问题,这是一个幻想的数据处理工具,您可以通过运行install.packages('dplyr')然后运行library('dplyr')

The cheatsheet for this package explains how to manipulate data very eloquently: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf 该软件包的备忘单解释了如何非常雄辩地操作数据: https//www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

I'm not entirely sure what you want to calculate. 我不完全确定你想要计算什么。 Are you trying to create a new column with a calculation based on the values in each row? 您是否尝试根据每行中的值进行计算来创建新列? Or, are you trying to calculate something for each unique value of ID ? 或者,您是否正在尝试为ID每个唯一值计算某些内容? In the former case, I would use dplyr::mutate(df, newcolumn = some_operation) . 在前一种情况下,我会使用dplyr::mutate(df, newcolumn = some_operation) In the latter case, I would use group_by(id) and then functions like filter() and summarise() to generate a new dataframe with one row for each ID. 在后一种情况下,我将使用group_by(id) ,然后使用filter()summarise()等函数生成一个新的数据帧,每个ID都有一行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM