简体   繁体   中英

R data.table to calculate new columns from existing columns bases in certain conditions

Let's say I have the following data table:

dta <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

from which I need the resulting table with two additional column's, cor_start and cor_end

dtb <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0),
  cor_start = c(12.0, 1.0, 8.0, 9.5, 13.0, 6.0),
  cor_end = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

the new columns need to be calculated with reference to phases column by checking if there is any previous row with the current matching phase value.

For better understanding, in this example:

  • row 3 has a matching phase of block2 in row 2
  • row 4 has a matching phase of block2 in row 3
  • row 5 has a matching phase of block3 in row 1
  • row 6 has a matching phase of block1 in row 2

however row 1 and row 2 have no previous matching phase rows. Note that the phase is of type list.

So, when there is a previous matching row, below are the conditions:

if (max_val in previous matching row is < end_val in current row)
  cor_start = previous matching row max_val
  cor_end = current row end_val

if (max_val in previous matching row is > end_val in current row)
  cor_start = current row end_val
  cor_end = current row end_val

and when there is no previous matching row, below are the conditions:

  cor_start = current row start_val
  cor_end = current row max_val

I looked into shift(), but could not figure out on how to set the above conditions ? Thanks!

Something like:

dta_transformed <- dta[,.(rn = .I, phase = unlist(phase)), by = setdiff(names(dta), 'phase')][
  , shifted_max := shift(max_val), by = phase][
    shifted_max < end_val, `:=` (cor_start = shifted_max, cor_end = end_val), by = phase][
      shifted_max > end_val, `:=` (cor_start = end_val, cor_end = end_val), by = phase][
        is.na(cor_start), `:=` (cor_start = start_val, cor_end = max_val), by = phase][
          , phase := paste(phase, collapse = ","), by = rn][!duplicated(rn),][
            , c("rn", "shifted_max") := NULL]

However, the output I get is:

   criteria         phase start_val end_val max_val cor_start cor_end
1:        A        block3        12      15    13.0      12.0      13
2:        A block1,block2         1      11     8.0       1.0       8
3:        B        block2         7      11     9.5       8.0      11
4:        A        block2         7      11    11.0       9.5      11
5:        A        block3        12      15    15.0      13.0      15
6:        B        block1         1       6     6.0       6.0       6

Could it be that in row number 3 the cor_end should be 11 in your desired output? As the previous matching row (2) has lower max_val , therefore the current end_val (11) should be taken?

Also the tidyverse approach, slightly more readable:

library(tidyverse)

dta %>% mutate(rn = row_number()) %>%
  unnest(phase) %>%
  group_by(phase) %>%
  mutate(
    cor_start = case_when(
      lag(max_val) < end_val ~ lag(max_val),
      lag(max_val) > end_val ~ end_val,
      TRUE ~ start_val
    ),
    cor_end = if_else(!is.na(lag(max_val)), end_val, max_val)
  ) %>% group_by(rn) %>%
  mutate(
    phase = paste(phase, collapse = ",")
  ) %>% ungroup() %>% select(-rn) %>% distinct()

Here is a different approach which uses pmin() instead of ifelse() and utilises the fill parameter of the shift() function. Furthermore, it reduces the number of grouping operations:

library(data.table)
dta[, rn := .I]
dta[dta[, .(phase2 = unlist(phase)), by = rn], on = "rn"][
  , `:=`(cor_start = pmin(shift(max_val, fill = start_val[1]), end_val), 
         cor_end = max_val), by = phase2][
    , .SD[1], by = rn][
      , c("rn", "phase2") := NULL][]
  criteria phase start_val end_val max_val cor_start cor_end 1: A block3 12 15 13.0 12.0 13.0 2: A block1,block2 1 11 8.0 1.0 8.0 3: B block2 7 11 9.5 8.0 9.5 4: A block2 7 11 11.0 9.5 11.0 5: A block3 12 15 15.0 13.0 15.0 6: B block1 1 6 6.0 6.0 6.0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM