简体   繁体   English

如何使用 R 中现有列中前一行的值创建新列

[英]How to create a new column with values from previous row in exsisting column, in a tibble, in R

I want to create a new column, that consists of the last value from a previous period for the same ID, placed in the same row as the first value for the next period.我想创建一个新列,其中包含同一 ID 上一时期的最后一个值,与下一个时期的第一个值放在同一行。 If there is no previous period NA should be applied.如果没有上一期,则应应用 NA。

However, I can't find any functions in any packages to solve this issue for me, so I expect I have to write a loop?但是,我在任何包中都找不到任何函数来为我解决这个问题,所以我希望我必须编写一个循环?

Does anyone out there have any idea how to solve this in a tidy manner (with or without a loop), that can be applied to a big tibble (+4 million observations)?有没有人知道如何以整洁的方式(有或没有循环)解决这个问题,这可以应用于一个大的 tibble(+400 万个观察)?

My data is ordered like the following df, and the goal is df1:我的数据按如下 df 排序,目标是 df1:

df <- tibble(
  ID = rep(c(77,88,99),each=6),
  PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
  DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
  RESULT = seq(from = 10, to = 44, by = 2)
)
df
# A tibble: 18 x 4
      ID PERIOD DATE       RESULT
   <dbl>  <dbl> <date>      <dbl>
 1    77      1 2020-06-01     10
 2    77      1 2020-06-02     12
 3    77      2 2020-06-03     14
 4    77      2 2020-06-04     16
 5    77      3 2020-06-05     18
 6    77      3 2020-06-06     20
 7    88      1 2020-06-07     22
 8    88      1 2020-06-08     24
 9    88      2 2020-06-09     26
10    88      2 2020-06-10     28
11    88      3 2020-06-11     30
12    88      3 2020-06-12     32
13    99      1 2020-06-13     34
14    99      1 2020-06-14     36
15    99      2 2020-06-15     38
16    99      2 2020-06-16     40
17    99      3 2020-06-17     42
18    99      3 2020-06-18     44

df1 <- tibble(
  ID = rep(c(77,88,99),each=6),
  PERIOD = rep(c(1,2,3,1,2,3,1,2,3),each=2),
  DATE = seq(as.Date("2020-06-01"), as.Date("2020-06-18"), by= "days"),
  RESULT = seq(from = 10, to = 44, by = 2),
  RESULT_post = c("NA","NA",12,"NA",16,"NA","NA","NA",24,"NA",28, 
                  "NA","NA", "NA",36, "NA",40, "NA" )
)
df1

# A tibble: 18 x 5
      ID PERIOD DATE       RESULT RESULT_pre
   <dbl>  <dbl> <date>      <dbl> <chr>     
 1    77      1 2020-06-01     10 NA        
 2    77      1 2020-06-02     12 NA        
 3    77      2 2020-06-03     14 12        
 4    77      2 2020-06-04     16 NA        
 5    77      3 2020-06-05     18 16        
 6    77      3 2020-06-06     20 NA        
 7    88      1 2020-06-07     22 NA        
 8    88      1 2020-06-08     24 NA        
 9    88      2 2020-06-09     26 24        
10    88      2 2020-06-10     28 NA        
11    88      3 2020-06-11     30 28        
12    88      3 2020-06-12     32 NA        
13    99      1 2020-06-13     34 NA        
14    99      1 2020-06-14     36 NA        
15    99      2 2020-06-15     38 36        
16    99      2 2020-06-16     40 NA        
17    99      3 2020-06-17     42 40        
18    99      3 2020-06-18     44 NA

All inputs are appreciated感谢所有输入

Thx / Sophia谢 / 索菲亚

Here's a way with dplyr :这是dplyr的一种方法:

library(dplyr)

df %>%
  group_by(ID, PERIOD) %>%
  summarise(RESULT_pre = last(RESULT)) %>%
  mutate(RESULT_pre = lag(RESULT_pre)) %>%
  left_join(df, by = c('ID', 'PERIOD')) %>%
  group_by(ID, PERIOD) %>%
  mutate(RESULT_pre = replace(RESULT_pre, -1, NA)) %>%
  select(-RESULT_pre, RESULT_pre)

#      ID PERIOD DATE       RESULT RESULT_pre
#   <dbl>  <dbl> <date>      <dbl>      <dbl>
# 1    77      1 2020-06-01     10         NA
# 2    77      1 2020-06-02     12         NA
# 3    77      2 2020-06-03     14         12
# 4    77      2 2020-06-04     16         NA
# 5    77      3 2020-06-05     18         16
# 6    77      3 2020-06-06     20         NA
# 7    88      1 2020-06-07     22         NA
# 8    88      1 2020-06-08     24         NA
# 9    88      2 2020-06-09     26         24
#10    88      2 2020-06-10     28         NA
#11    88      3 2020-06-11     30         28
#12    88      3 2020-06-12     32         NA
#13    99      1 2020-06-13     34         NA
#14    99      1 2020-06-14     36         NA
#15    99      2 2020-06-15     38         36
#16    99      2 2020-06-16     40         NA
#17    99      3 2020-06-17     42         40
#18    99      3 2020-06-18     44         NA

The logic here is to summarise last RESULT value for each ID and PERIOD and use lag to shift the value in each ID .这里的逻辑是总结每个IDPERIOD last RESULT值,并使用lag来移动每个ID的值。 We join this result with the original dataset and keep only first value in each group and replace all other value with NA .我们将此结果与原始数据集连接起来,只保留每组中的第一个值,并用NA替换所有其他值。

You can copy all shifted values and overwrite those not fitting with NA :您可以复制所有移位的值并覆盖那些不适合NA

n <- nrow(df)
df$RESULT_pre <- c(NA, df$RESULT[-n])
df$RESULT_pre[c(FALSE, df$ID[-1] != df$ID[-n] |
   df$PERIOD[-1] == df$PERIOD[-n])] <- NA
df
#   ID PERIOD       DATE RESULT RESULT_pre
#1  77      1 2020-06-01     10         NA
#2  77      1 2020-06-02     12         NA
#3  77      2 2020-06-03     14         12
#4  77      2 2020-06-04     16         NA
#5  77      3 2020-06-05     18         16
#6  77      3 2020-06-06     20         NA
#7  88      1 2020-06-07     22         NA
#8  88      1 2020-06-08     24         NA
#9  88      2 2020-06-09     26         24
#10 88      2 2020-06-10     28         NA
#11 88      3 2020-06-11     30         28
#12 88      3 2020-06-12     32         NA
#13 99      1 2020-06-13     34         NA
#14 99      1 2020-06-14     36         NA
#15 99      2 2020-06-15     38         36
#16 99      2 2020-06-16     40         NA
#17 99      3 2020-06-17     42         40
#18 99      3 2020-06-18     44         NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM