简体   繁体   English

有序组中的孔

[英]Holes in ordered groups

I have a data frame ordered by id and year , observed n times over a number of years.我有一个按idyear排序的数据框,多年来观察了 n 次。 Number of observations per individual per year is irregular.每个人每年的观察次数是不规则的。 I define a "hole" in the data as an observation where x2=1 and the observation immediatly above, for the same id (not necessarily for the same year ), is equal to 0. For example, individual A has a hole in 2002. When this happens, I need to create a variable where I store the value of x1 immediatly above, for which x2=0 .我将数据中的一个“洞”定义为一个观察值,其中x2=1和上面的观察值,对于相同的id (不一定是同year ),等于 0。例如,个人 A 在 2002 年有一个洞. 发生这种情况时,我需要创建一个变量,我将x1的值直接存储在上面,为此x2=0 In the example of individual A, I would then need the new variable to equal 5 when x2=1 .在个人 A 的示例中,我需要新变量在x2=1时等于 5。

x1 = c(5,3,2,2,5,7,7,3,4,8)
x2 = c(0,1,0,1,0,1,0,1,0,1)
id = c("A","A","A","B","B","C","C","C","D","D")
year = c(2001,2002,2003,2001,2002,2001,2001,2002,2001,2002)

df = data.frame(year,id,x1,x2)

Considering this sample data frame, I would need the new variable to look like this:考虑到这个示例数据框,我需要新变量如下所示:

outcome = c(.,5,.,.,.,.,.,7,.,4)

The dataset I'm working with has close to 10.000.000 observations, for 3.000.000 individuals over 4 years, so I can't do this manually.我正在使用的数据集有近 10.000.000 个观察值,针对 3.000.000 个人超过 4 年,所以我无法手动执行此操作。 Is there any generalized way to achieve this that works with any dataset, regardless of dimension?是否有任何通用的方法来实现这一点,适用于任何数据集,无论维度如何?

I went through a few posts here using for loops to iterate over groups (one example was this one Iterating a for loop over groups in a dataset ) but I wasn't able to apply any of it.我在这里浏览了一些使用 for 循环迭代组的帖子(一个例子是Iterating a for loop over groups in a dataset ),但我无法应用其中的任何一个。 I've been trying to do it in R after being unsuccessful in stata 14. I wasn't able to find any post that applied to ordered groups, which is what I'm looking for.在 stata 14 中失败后,我一直在尝试在 R 中进行此操作。我找不到任何适用于有序组的帖子,这正是我正在寻找的。

Here's a simple way to get your outcome with dplyr .这是使用dplyr获得outcome的简单方法。

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(
    outcome = ifelse(x2 == 1 & lag(x2) == 0, lag(x1), NA)
  )

Result结果

# A tibble: 10 × 5
# Groups:   id [4]
    year id       x1    x2 outcome
   <dbl> <chr> <dbl> <dbl>   <dbl>
 1  2001 A         5     0      NA
 2  2002 A         3     1       5
 3  2003 A         2     0      NA
 4  2001 B         2     1      NA
 5  2002 B         5     0      NA
 6  2001 C         7     1      NA
 7  2001 C         7     0      NA
 8  2002 C         3     1       7
 9  2001 D         4     0      NA
10  2002 D         8     1       4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM