根据data.table中特定列的多个条件标记行

Question

I have a data.table with multiple columns of "performance" in specific years and a column named "expected performance". 我有一个data.table，在特定年份有多列“性能”和一个名为“预期性能”的列。 I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions: 我想创建一个名为FLAG的新列，它将根据以下两个条件指示标记为手动审阅的行：

Any of the performance columns has a negative value 任何性能列都具有负值
The expected performance column is different from any of the performance columns by more than 50%. 预期的性能列与任何性能列的不同之处超过50％。

A mock data.table similar to the one I have: 一个类似于我的模拟data.table：

library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
                 Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
                 Type = c("T", "B", "B", "T", "T", "B", "B"),
                 Sold = c(500, 300, 350, 500, 350, 400, 450),
                 Baseline = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
                 Perf_2016 = c(-200, 420, 800, 900, -10, 75, 400),
                 Perf_2017 = c(500, 300, -20, 700, 50, 80, 370),
                 Perf_2018 = c(1000, 400, 600, 800, 40, 500, 300),
                 ExpPerf_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt

Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T   500  2000     -200      500       1000      1500
N34 ACBD B   300  2100     420       300       400       380
N11 ACCD B   350  2000     800       -20       600       500
N65 ADBN T   500  1500     900       700       800       850
N55 ADDD T   350  1890     -10       50        40        30
N78 DBCA B   400  1900     75        80        500       400
N88 CBDA B   450  2000     400       370       300       350

For this data.table the desired output would add the FLAG column as seen below: 对于此data.table，所需的输出将添加FLAG列，如下所示：

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

Answer 1

Any of the performance columns has a negative value 任何性能列都具有负值

The expected performance column is different from any of the performance columns by more than 50%. 预期的性能列与任何性能列的不同之处超过50％。

In other words, there are common min and max bounds for these columns: 换句话说，这些列有共同的最小和最大界限：

the min is max(0, ExpPerf*0.5) min是max（0，ExpPerf * 0.5）
the max is ExpPerf*1.5 最大值是ExpPerf * 1.5

So... 所以...

dt[, v := !Reduce(`&`, 
  lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019     v
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

How it works: 这个怎么运作：

between checks if a column lies between the min and max between如果列位于min和max之间的检查
lapply applies the check to each column, returning a list lapply将检查应用于每一列，返回一个列表
Reduce with & checks whether all columns meet the condition 使用& Reduce并检查所有列是否满足条件
! negates the result, so we identify cases where at least one column fails the condition 否定结果，因此我们确定至少有一列失败的情况

between , & and ! between ， &和! are vectorized operators, so we end up with a vector of results, one for each row. 是矢量化运算符，所以我们最终得到一个结果向量，每行一个。 I would probably write this sequence in magrittr so the steps are simpler to follow: 我可能会在magrittr中编写这个序列，所以步骤更容易理解：

library(magrittr)

dt[, v := .SD %>% 
  lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
  Reduce(f=`&`) %>%
  not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]

not is a relabeling of ! not重拍! , offered by magrittr for convenience. ，为方便起见，由magrittr提供。

.SD is a special symbol for the subset of data operated on inside the j part of DT[i, j, by] . .SD是在DT[i, j, by]的j部分内部操作的数据子集的特殊符号。 In this case, there is no i or by , so only .SDcols is subsetting (to select the columns of interest). 在这种情况下，没有i或by ，因此只有.SDcols是子集（用于选择感兴趣的列）。

Comment 评论

The code would be simpler if the OP chose to format the data in long format. 如果OP选择以长格式格式化数据，则代码将更简单。
My answer uses the same steps as Gilean's, but is vectorised instead of calculating per row. 我的答案使用与Gilean相同的步骤，但是是矢量化而不是每行计算。

Answer 2

You can use the following code to check for your two conditions: 您可以使用以下代码检查两个条件：

dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
   by = Id,
   .SDcols = grep("^Perf", colnames(dt), value = TRUE)
   ]

The result: 结果：

> dt
    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

根据data.table中特定列的多个条件标记行

问题描述

2 个解决方案

解决方案1
6 已采纳 2019-07-15 15:41:20

解决方案2
2 2019-07-15 15:12:36

根据data.table中特定列的多个条件标记行

问题描述

2 个解决方案

解决方案1 6 已采纳 2019-07-15 15:41:20

解决方案2 2 2019-07-15 15:12:36

解决方案1
6 已采纳 2019-07-15 15:41:20

解决方案2
2 2019-07-15 15:12:36