[英]Flag rows based on multiple conditions on specific columns in data.table
I have a data.table with multiple columns of "performance" in specific years and a column named "expected performance". 我有一个data.table,在特定年份有多列“性能”和一个名为“预期性能”的列。 I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions:
我想创建一个名为FLAG的新列,它将根据以下两个条件指示标记为手动审阅的行:
A mock data.table similar to the one I have: 一个类似于我的模拟data.table:
library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
Type = c("T", "B", "B", "T", "T", "B", "B"),
Sold = c(500, 300, 350, 500, 350, 400, 450),
Baseline = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
Perf_2016 = c(-200, 420, 800, 900, -10, 75, 400),
Perf_2017 = c(500, 300, -20, 700, 50, 80, 370),
Perf_2018 = c(1000, 400, 600, 800, 40, 500, 300),
ExpPerf_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt
Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T 500 2000 -200 500 1000 1500
N34 ACBD B 300 2100 420 300 400 380
N11 ACCD B 350 2000 800 -20 600 500
N65 ADBN T 500 1500 900 700 800 850
N55 ADDD T 350 1890 -10 50 40 30
N78 DBCA B 400 1900 75 80 500 400
N88 CBDA B 450 2000 400 370 300 350
For this data.table the desired output would add the FLAG column as seen below: 对于此data.table,所需的输出将添加FLAG列,如下所示:
Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019 FLAG
1: N23 ABCD T 500 2000 -200 500 1000 1500 TRUE
2: N34 ACBD B 300 2100 420 300 400 380 FALSE
3: N11 ACCD B 350 2000 800 -20 600 500 TRUE
4: N65 ADBN T 500 1500 900 700 800 850 FALSE
5: N55 ADDD T 350 1890 -10 50 40 30 TRUE
6: N78 DBCA B 400 1900 75 80 500 400 TRUE
7: N88 CBDA B 450 2000 400 370 300 350 FALSE
- Any of the performance columns has a negative value
任何性能列都具有负值
- The expected performance column is different from any of the performance columns by more than 50%.
预期的性能列与任何性能列的不同之处超过50%。
In other words, there are common min and max bounds for these columns: 换句话说,这些列有共同的最小和最大界限:
So... 所以...
dt[, v := !Reduce(`&`,
lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]
Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019 v
1: N23 ABCD T 500 2000 -200 500 1000 1500 TRUE
2: N34 ACBD B 300 2100 420 300 400 380 FALSE
3: N11 ACCD B 350 2000 800 -20 600 500 TRUE
4: N65 ADBN T 500 1500 900 700 800 850 FALSE
5: N55 ADDD T 350 1890 -10 50 40 30 TRUE
6: N78 DBCA B 400 1900 75 80 500 400 TRUE
7: N88 CBDA B 450 2000 400 370 300 350 FALSE
How it works: 这个怎么运作:
between
checks if a column lies between the min and max between
如果列位于min和max之间的检查 lapply
applies the check to each column, returning a list lapply
将检查应用于每一列,返回一个列表 Reduce
with &
checks whether all columns meet the condition &
Reduce
并检查所有列是否满足条件 !
negates the result, so we identify cases where at least one column fails the condition between
, &
and !
between
, &
和!
are vectorized operators, so we end up with a vector of results, one for each row. 是矢量化运算符,所以我们最终得到一个结果向量,每行一个。 I would probably write this sequence in magrittr so the steps are simpler to follow:
我可能会在magrittr中编写这个序列,所以步骤更容易理解:
library(magrittr)
dt[, v := .SD %>%
lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
Reduce(f=`&`) %>%
not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]
not
is a relabeling of !
not
重拍!
, offered by magrittr for convenience. ,为方便起见,由magrittr提供。
.SD
is a special symbol for the subset of data operated on inside the j
part of DT[i, j, by]
. .SD
是在DT[i, j, by]
的j
部分内部操作的数据子集的特殊符号。 In this case, there is no i
or by
, so only .SDcols
is subsetting (to select the columns of interest). 在这种情况下,没有
i
或by
,因此只有.SDcols
是子集(用于选择感兴趣的列)。
Comment 评论
You can use the following code to check for your two conditions: 您可以使用以下代码检查两个条件:
dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
by = Id,
.SDcols = grep("^Perf", colnames(dt), value = TRUE)
]
The result: 结果:
> dt
Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019 FLAG
1: N23 ABCD T 500 2000 -200 500 1000 1500 TRUE
2: N34 ACBD B 300 2100 420 300 400 380 FALSE
3: N11 ACCD B 350 2000 800 -20 600 500 TRUE
4: N65 ADBN T 500 1500 900 700 800 850 FALSE
5: N55 ADDD T 350 1890 -10 50 40 30 TRUE
6: N78 DBCA B 400 1900 75 80 500 400 TRUE
7: N88 CBDA B 450 2000 400 370 300 350 FALSE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.