简体   繁体   English

R data.table在特定条件下根据现有列基础计算新列

[英]R data.table to calculate new columns from existing columns bases in certain conditions

Let's say I have the following data table: 假设我有以下数据表:

dta <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

from which I need the resulting table with two additional column's, cor_start and cor_end 从中我需要带有两个附加列的结果表cor_startcor_end

dtb <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0),
  cor_start = c(12.0, 1.0, 8.0, 9.5, 13.0, 6.0),
  cor_end = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

the new columns need to be calculated with reference to phases column by checking if there is any previous row with the current matching phase value. 需要通过检查是否有任何先前的行具有当前匹配的相位值来参考phases位列来计算新列。

For better understanding, in this example: 为了更好地理解,在此示例中:

  • row 3 has a matching phase of block2 in row 2 第3行在第2行中具有block2的匹配阶段
  • row 4 has a matching phase of block2 in row 3 第4行在第3行中具有block2的匹配阶段
  • row 5 has a matching phase of block3 in row 1 第5行在第1行中具有block3的匹配阶段
  • row 6 has a matching phase of block1 in row 2 第6行在第2行中具有block1的匹配阶段

however row 1 and row 2 have no previous matching phase rows. 但是第1行和第2行没有先前的匹配阶段行。 Note that the phase is of type list. 请注意,该phase是列表类型。

So, when there is a previous matching row, below are the conditions: 因此,当存在上一个匹配行时,以下是条件:

if (max_val in previous matching row is < end_val in current row)
  cor_start = previous matching row max_val
  cor_end = current row end_val

if (max_val in previous matching row is > end_val in current row)
  cor_start = current row end_val
  cor_end = current row end_val

and when there is no previous matching row, below are the conditions: 当没有先前的匹配行时,以下是条件:

  cor_start = current row start_val
  cor_end = current row max_val

I looked into shift(), but could not figure out on how to set the above conditions ? 我调查了shift(),但不知道如何设置上述条件? Thanks! 谢谢!

Something like: 就像是:

dta_transformed <- dta[,.(rn = .I, phase = unlist(phase)), by = setdiff(names(dta), 'phase')][
  , shifted_max := shift(max_val), by = phase][
    shifted_max < end_val, `:=` (cor_start = shifted_max, cor_end = end_val), by = phase][
      shifted_max > end_val, `:=` (cor_start = end_val, cor_end = end_val), by = phase][
        is.na(cor_start), `:=` (cor_start = start_val, cor_end = max_val), by = phase][
          , phase := paste(phase, collapse = ","), by = rn][!duplicated(rn),][
            , c("rn", "shifted_max") := NULL]

However, the output I get is: 但是,我得到的输出是:

   criteria         phase start_val end_val max_val cor_start cor_end
1:        A        block3        12      15    13.0      12.0      13
2:        A block1,block2         1      11     8.0       1.0       8
3:        B        block2         7      11     9.5       8.0      11
4:        A        block2         7      11    11.0       9.5      11
5:        A        block3        12      15    15.0      13.0      15
6:        B        block1         1       6     6.0       6.0       6

Could it be that in row number 3 the cor_end should be 11 in your desired output? 可能是在第3行中,所需输出的cor_end应该为11吗? As the previous matching row (2) has lower max_val , therefore the current end_val (11) should be taken? 由于前一个匹配行(2)的max_val较低,因此应采用当前end_val (11)?

Also the tidyverse approach, slightly more readable: 还有tidyverse方法,可读性更高:

library(tidyverse)

dta %>% mutate(rn = row_number()) %>%
  unnest(phase) %>%
  group_by(phase) %>%
  mutate(
    cor_start = case_when(
      lag(max_val) < end_val ~ lag(max_val),
      lag(max_val) > end_val ~ end_val,
      TRUE ~ start_val
    ),
    cor_end = if_else(!is.na(lag(max_val)), end_val, max_val)
  ) %>% group_by(rn) %>%
  mutate(
    phase = paste(phase, collapse = ",")
  ) %>% ungroup() %>% select(-rn) %>% distinct()

Here is a different approach which uses pmin() instead of ifelse() and utilises the fill parameter of the shift() function. 这是使用pmin()代替ifelse()并利用shift()函数的fill参数的另一种方法。 Furthermore, it reduces the number of grouping operations: 此外,它减少了分组操作的数量:

library(data.table)
dta[, rn := .I]
dta[dta[, .(phase2 = unlist(phase)), by = rn], on = "rn"][
  , `:=`(cor_start = pmin(shift(max_val, fill = start_val[1]), end_val), 
         cor_end = max_val), by = phase2][
    , .SD[1], by = rn][
      , c("rn", "phase2") := NULL][]
  criteria phase start_val end_val max_val cor_start cor_end 1: A block3 12 15 13.0 12.0 13.0 2: A block1,block2 1 11 8.0 1.0 8.0 3: B block2 7 11 9.5 8.0 9.5 4: A block2 7 11 11.0 9.5 11.0 5: A block3 12 15 15.0 13.0 15.0 6: B block1 1 6 6.0 6.0 6.0 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM