简体   繁体   English

基于 R 中的列和前一行值的新列值

[英]New column value based on column and previous row values in R

I'm having issues to figuring out the code for the following: We have our basic table named as shown below我在找出以下代码时遇到问题:我们的基本表命名如下

Status is calculate if qty_ordered == qty_delivered then Status = "D" else Status == "N" . Status是计算如果qty_ordered == qty_delivered然后Status = "D"其他Status == "N"

For Flag is where it gets more complicated: Flag == "Yes" if Status== "D" , unless there's 2 "N" s on 2 previous consecutive days as you can see on May 6th for apples.对于Flag来说,它变得更加复杂: Flag == "Yes" if Status== "D" ,除非在 5 月 6 日的苹果中,您可以在连续两天看到 2 个"N"

Flag == "N" if it's at least 2 consecutive days of N, as shown for apple in May4rth. Flag == "N"如果它是Flag == "N"的至少连续 2 天,如 5 月 4 日的苹果所示。

Item物品 Date日期 qty_ordered已订购数量 qty_delivered已交付数量 Status地位 Flag旗帜
Apple苹果 1-May 1-5月 100 100 100 100 D D YES是的
Apple苹果 2-May 2-5 月 100 100 100 100 D D YES是的
Apple苹果 3-May 3-5月 100 100 70 70 N N YES是的
Apple苹果 4-May 4-5 月 100 100 0 0 N N NO
Apple苹果 5-May 5-5 月 100 100 0 0 N N NO
Apple苹果 6-May 6-5月 100 100 100 100 D D NO
Apple苹果 7-May 7-5 月 100 100 100 100 D D YES是的
Banana香蕉 1-May 1-5月 50 50 50 50 D D YES是的
Banana香蕉 2-May 2-5 月 50 50 0 0 N N YES是的
Banana香蕉 3-May 3-5月 50 50 50 50 D D YES是的
Banana香蕉 4-May 4-5 月 50 50 50 50 D D YES是的
Banana香蕉 5-May 5-5 月 50 50 50 50 D D YES是的

I usually do mutate to calculate new fields such as: df <- mutate(df,Flag= if_else(qty_ordered == qty_delivered, "YES","NO"))我通常使用mutate来计算新字段,例如: df <- mutate(df,Flag= if_else(qty_ordered == qty_delivered, "YES","NO"))

but this doesn't includes the validation if the previous days that the problem needs.但这不包括问题需要的前几天的验证。

Any help would be appreciated.任何帮助,将不胜感激。

You can use lag to refer to previous values.您可以使用lag来引用以前的值。 Try -尝试 -

library(dplyr)

df %>%
  mutate(flag = case_when(lag(Status) == 'N' & lag(Status, 2) == 'N' ~ 'NO',
                          Status == 'D' | lag(Status) == 'D' ~ 'YES', 
                          TRUE ~ 'NO'))

#     Item  Date qty_ordered qty_delivered Status Flag
#1   Apple 1-May         100           100      D  YES
#2   Apple 2-May         100           100      D  YES
#3   Apple 3-May         100            70      N  YES
#4   Apple 4-May         100             0      N   NO
#5   Apple 5-May         100             0      N   NO
#6   Apple 6-May         100           100      D   NO
#7   Apple 7-May         100           100      D  YES
#8  Banana 1-May          50            50      D  YES
#9  Banana 2-May          50             0      N  YES
#10 Banana 3-May          50            50      D  YES
#11 Banana 4-May          50            50      D  YES
#12 Banana 5-May          50            50      D  YES 

You may want to add group_by(Item) to do this separately for each Item .您可能需要添加group_by(Item)来为每个Item单独执行此操作。

Does this work:这是否有效:

library(dplyr)
df %>% group_by(Item) %>% mutate(Flag = case_when(Status == 'N' & lag(Status == 'N') ~ 'NO', TRUE ~ 'YES'))
# A tibble: 12 x 6
# Groups:   Item [2]
   Item   Date  qty_ordered qty_delivered Status Flag 
   <chr>  <chr>       <int>         <int> <chr>  <chr>
 1 Apple  1-May         100           100 D      YES  
 2 Apple  2-May         100           100 D      YES  
 3 Apple  3-May         100            70 N      YES  
 4 Apple  4-May         100             0 N      NO   
 5 Apple  5-May         100             0 N      NO   
 6 Apple  6-May         100           100 D      YES  
 7 Apple  7-May         100           100 D      YES  
 8 Banana 1-May          50            50 D      YES  
 9 Banana 2-May          50             0 N      YES  
10 Banana 3-May          50            50 D      YES  
11 Banana 4-May          50            50 D      YES  
12 Banana 5-May          50            50 D      YES  

You can do this in base R with您可以在基础 R 中执行此操作

# create a data.frame with only the relevant columns
dat <- data.frame(Item = c(rep("Apple", 7), rep("Banana", 5)), 
                  Status = c("D", "D", "N", "N", "N", "D", "D", 
                             "D", "N", "D", "D", "D"))

# create the flag column
transform(dat, Flag = ave(Status == "N", Item, FUN = function(is_N)
  ifelse(c(F, head(is_N, -1)) & (c(F, F, head(is_N, -2)) | is_N), "NO", "YES")))
#R>      Item Status Flag
#R> 1   Apple      D  YES
#R> 2   Apple      D  YES
#R> 3   Apple      N  YES
#R> 4   Apple      N   NO
#R> 5   Apple      N   NO
#R> 6   Apple      D   NO
#R> 7   Apple      D  YES
#R> 8  Banana      D  YES
#R> 9  Banana      N  YES
#R> 10 Banana      D  YES
#R> 11 Banana      D  YES
#R> 12 Banana      D  YES

A faster variant is一个更快的变体是

transform(dat, Flag = ave(Status == "N", Item, FUN = function(is_N)
  c("YES", "NO")[
    1L + (c(F, head(is_N, -1)) & (c(F, F, head(is_N, -2)) | is_N))]))

and here is a small simulation study这是一个小型的模拟研究

# perform a benchmark study with simulated data
library(dplyr)
set.seed(1)
n_lvls <- 1000L
n_per_lvl <- 6L
dat <- data.frame(
  Item = as.character(gl(n = n_lvls, n_per_lvl)), 
  Status = sample(c("D", "N"), replace = TRUE, n_per_lvl * n_lvls))

bench::mark(
  first = transform(dat, Flag = ave(Status == "N", Item, FUN = function(is_N)
    ifelse(c(F, head(is_N, -1)) & (c(F, F, head(is_N, -2)) | is_N), "NO", "YES"))),
  faster = transform(dat, Flag = ave(Status == "N", Item, FUN = function(is_N)
    c("YES", "NO")[
      1L + (c(F, head(is_N, -1)) & (c(F, F, head(is_N, -2)) | is_N))])), 
  dplyr = dat %>% group_by(Item) %>%
    mutate(flag = case_when(lag(Status) == 'N' & lag(Status, 2) == 'N' ~ 'NO',
                            Status == 'D' | lag(Status) == 'D' ~ 'YES', 
                            TRUE ~ 'NO')), check = FALSE)
#R> # A tibble: 3 x 13
#R>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time           gc              
#R>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>         <list>          
#R> 1 first        16.4ms   17.1ms     53.6      511KB     15.9    27     8      504ms <NULL> <Rprofmem [34 × 3]> <bench_tm [27… <tibble [27 × 3…
#R> 2 faster       12.1ms     14ms     58.4      511KB     15.6    30     8      514ms <NULL> <Rprofmem [34 × 3]> <bench_tm [30… <tibble [30 × 3…
#R> 3 dplyr         196ms  196.7ms      5.03     333KB     15.1     3     9      596ms <NULL> <Rprofmem [45 × 3]> <bench_tm [3]> <tibble [3 × 3]>

where dplyr is 196 / 12.1 ~ 16 times slower.其中dplyr慢 196 / 12.1 ~ 16 倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM