Let's say I have the following data table:
dta <- data.table(
criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)
from which I need the resulting table with two additional column's, cor_start
and cor_end
dtb <- data.table(
criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0),
cor_start = c(12.0, 1.0, 8.0, 9.5, 13.0, 6.0),
cor_end = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)
the new columns need to be calculated with reference to phases
column by checking if there is any previous row with the current matching phase value.
For better understanding, in this example:
however row 1 and row 2 have no previous matching phase rows. Note that the phase
is of type list.
So, when there is a previous matching row, below are the conditions:
if (max_val in previous matching row is < end_val in current row)
cor_start = previous matching row max_val
cor_end = current row end_val
if (max_val in previous matching row is > end_val in current row)
cor_start = current row end_val
cor_end = current row end_val
and when there is no previous matching row, below are the conditions:
cor_start = current row start_val
cor_end = current row max_val
I looked into shift(), but could not figure out on how to set the above conditions ? Thanks!
Something like:
dta_transformed <- dta[,.(rn = .I, phase = unlist(phase)), by = setdiff(names(dta), 'phase')][
, shifted_max := shift(max_val), by = phase][
shifted_max < end_val, `:=` (cor_start = shifted_max, cor_end = end_val), by = phase][
shifted_max > end_val, `:=` (cor_start = end_val, cor_end = end_val), by = phase][
is.na(cor_start), `:=` (cor_start = start_val, cor_end = max_val), by = phase][
, phase := paste(phase, collapse = ","), by = rn][!duplicated(rn),][
, c("rn", "shifted_max") := NULL]
However, the output I get is:
criteria phase start_val end_val max_val cor_start cor_end
1: A block3 12 15 13.0 12.0 13
2: A block1,block2 1 11 8.0 1.0 8
3: B block2 7 11 9.5 8.0 11
4: A block2 7 11 11.0 9.5 11
5: A block3 12 15 15.0 13.0 15
6: B block1 1 6 6.0 6.0 6
Could it be that in row number 3 the cor_end
should be 11 in your desired output? As the previous matching row (2) has lower max_val
, therefore the current end_val
(11) should be taken?
Also the tidyverse
approach, slightly more readable:
library(tidyverse)
dta %>% mutate(rn = row_number()) %>%
unnest(phase) %>%
group_by(phase) %>%
mutate(
cor_start = case_when(
lag(max_val) < end_val ~ lag(max_val),
lag(max_val) > end_val ~ end_val,
TRUE ~ start_val
),
cor_end = if_else(!is.na(lag(max_val)), end_val, max_val)
) %>% group_by(rn) %>%
mutate(
phase = paste(phase, collapse = ",")
) %>% ungroup() %>% select(-rn) %>% distinct()
Here is a different approach which uses pmin()
instead of ifelse()
and utilises the fill
parameter of the shift()
function. Furthermore, it reduces the number of grouping operations:
library(data.table)
dta[, rn := .I]
dta[dta[, .(phase2 = unlist(phase)), by = rn], on = "rn"][
, `:=`(cor_start = pmin(shift(max_val, fill = start_val[1]), end_val),
cor_end = max_val), by = phase2][
, .SD[1], by = rn][
, c("rn", "phase2") := NULL][]
criteria phase start_val end_val max_val cor_start cor_end 1: A block3 12 15 13.0 12.0 13.0 2: A block1,block2 1 11 8.0 1.0 8.0 3: B block2 7 11 9.5 8.0 9.5 4: A block2 7 11 11.0 9.5 11.0 5: A block3 12 15 15.0 13.0 15.0 6: B block1 1 6 6.0 6.0 6.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.