简体   繁体   English

在 R 的长时间序列数据集中保留前一年未发生的观察

[英]Retain observations that hasn't occured in the year before in a long time-series dataset in R

I have a df that looks like this:我有一个看起来像这样的df:

ID Year
5  2010
5  2011
5  2014
3  2013
3  2014
10 2013
1  2010
1  2012
1  2014
...

The df contains the years 2009-2019, and is filtered on individuals living in a one particular town, that are 18-64 years old at that particular year. df 包含 2009-2019 年,并针对居住在某个特定城镇的个人进行过滤,这些人在该特定年份的年龄在 18-64 岁之间。

For every year I need to keep only individuals that have moved into this town that particular year.对于每一年,我只需要保留那一年搬进这个小镇的人。 So for example, I need to keep the difference between the population at year 2010 minus the population at year 2009. I also need to do this for every year (so for example, some people move out of town for a couple of years and then return - ID 5 is an example of this).因此,例如,我需要保留 2010 年的人口减去 2009 年的人口之间的差异。我还需要每年都这样做(例如,有些人搬出城几年,然后return - ID 5 就是一个例子)。 In the end, I want one df for every year 2010-2019, so ten dfs that contain only individuals that moved into town that particular year.最后,我想要 2010-2019 年的每一年一个 df,所以十个 df 只包含那一年搬进城里的人。

I have played around with group_by() and left_join() , but haven't managed to succeed.我玩过group_by()left_join() ,但没有成功。 There must be a simple solution, but I haven't been able to find one yet.必须有一个简单的解决方案,但我还没有找到一个。

You can use the setdiff function to perform set(A) - set(B) operation.您可以使用setdiff function 来执行 set(A) - set(B) 操作。 Split your data into dataframes by year, and then loop through them, finding the new joiners.按年份将数据拆分为数据框,然后循环遍历它们,找到新的加入者。

Example code:示例代码:

library(dplyr)
set.seed(123)
df <- tibble(
    id = c(1, 2, 3, 4, 5,     # first year
           1, 2, 3, 5, 6, 7,  # 4 moves out, 6,7 move in
           2, 3, 4, 6, 7, 8), # 1,5 moves out, 4,8 move in
    year = c(rep(2009, 5), 
             rep(2010, 6), 
             rep(2011, 6)), 
    age = sample(18:64, size = 17) # extra column
)

# split into list of dataframes by year
df_by_year <- split(df, df$year)

# create a list to contain the 2 df (total years 3 - 1)
df_list <- vector("list", 2)

for(i in 1:length(df_list)){

    # determine incoming new people        
    new_joinees <- setdiff(df_by_year[[i+1]]$id, df_by_year[[i]]$id)

    # filter for above IDs
    df_list[[i]] <- dplyr::filter(df_by_year[[i+1]], id %in% new_joinees)
    
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM