简体   繁体   English

根据data.table中特定列的多个条件标记行

[英]Flag rows based on multiple conditions on specific columns in data.table

I have a data.table with multiple columns of "performance" in specific years and a column named "expected performance". 我有一个data.table,在特定年份有多列“性能”和一个名为“预期性能”的列。 I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions: 我想创建一个名为FLAG的新列,它将根据以下两个条件指示标记为手动审阅的行:

  1. Any of the performance columns has a negative value 任何性能列都具有负值
  2. The expected performance column is different from any of the performance columns by more than 50%. 预期的性能列与任何性能列的不同之处超过50%。

A mock data.table similar to the one I have: 一个类似于我的模拟data.table:

library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
                 Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
                 Type = c("T", "B", "B", "T", "T", "B", "B"),
                 Sold = c(500, 300, 350, 500, 350, 400, 450),
                 Baseline = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
                 Perf_2016 = c(-200, 420, 800, 900, -10, 75, 400),
                 Perf_2017 = c(500, 300, -20, 700, 50, 80, 370),
                 Perf_2018 = c(1000, 400, 600, 800, 40, 500, 300),
                 ExpPerf_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt

Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T   500  2000     -200      500       1000      1500
N34 ACBD B   300  2100     420       300       400       380
N11 ACCD B   350  2000     800       -20       600       500
N65 ADBN T   500  1500     900       700       800       850
N55 ADDD T   350  1890     -10       50        40        30
N78 DBCA B   400  1900     75        80        500       400
N88 CBDA B   450  2000     400       370       300       350

For this data.table the desired output would add the FLAG column as seen below: 对于此data.table,所需的输出将添加FLAG列,如下所示:

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE
  1. Any of the performance columns has a negative value 任何性能列都具有负值
  2. The expected performance column is different from any of the performance columns by more than 50%. 预期的性能列与任何性能列的不同之处超过50%。

In other words, there are common min and max bounds for these columns: 换句话说,这些列有共同的最小和最大界限:

  • the min is max(0, ExpPerf*0.5) min是max(0,ExpPerf * 0.5)
  • the max is ExpPerf*1.5 最大值是ExpPerf * 1.5

So... 所以...

dt[, v := !Reduce(`&`, 
  lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019     v
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

How it works: 这个怎么运作:

  • between checks if a column lies between the min and max between如果列位于min和max之间的检查
  • lapply applies the check to each column, returning a list lapply将检查应用于每一列,返回一个列表
  • Reduce with & checks whether all columns meet the condition 使用& Reduce并检查所有列是否满足条件
  • ! negates the result, so we identify cases where at least one column fails the condition 否定结果,因此我们确定至少有一列失败的情况

between , & and ! between&! are vectorized operators, so we end up with a vector of results, one for each row. 是矢量化运算符,所以我们最终得到一个结果向量,每行一个。 I would probably write this sequence in magrittr so the steps are simpler to follow: 我可能会在magrittr中编写这个序列,所以步骤更容易理解:

library(magrittr)

dt[, v := .SD %>% 
  lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
  Reduce(f=`&`) %>%
  not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]

not is a relabeling of ! not重拍! , offered by magrittr for convenience. ,为方便起见,由magrittr提供。

.SD is a special symbol for the subset of data operated on inside the j part of DT[i, j, by] . .SD是在DT[i, j, by]j部分内部操作的数据子集的特殊符号。 In this case, there is no i or by , so only .SDcols is subsetting (to select the columns of interest). 在这种情况下,没有iby ,因此只有.SDcols是子集(用于选择感兴趣的列)。

Comment 评论

  • The code would be simpler if the OP chose to format the data in long format. 如果OP选择以长格式格式化数据,则代码将更简单。
  • My answer uses the same steps as Gilean's, but is vectorised instead of calculating per row. 我的答案使用与Gilean相同的步骤,但是是矢量化而不是每行计算。

You can use the following code to check for your two conditions: 您可以使用以下代码检查两个条件:

dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
   by = Id,
   .SDcols = grep("^Perf", colnames(dt), value = TRUE)
   ]

The result: 结果:

> dt
    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据条件将data.table列移至行 - Moving data.table columns to rows based on conditions 在 R data.table 中,根据具有多个条件的其他列中的元素有条件地删除行 - In R data.table conditionally remove rows based on elements in other columns with multiple conditions 根据多个列和条件更新data.table - Update data.table based on multiple columns and conditions data.table根据两个条件将列拆分为多个列 - data.table Splitting column into multiple columns based on two conditions 根据条件将行与data.table绑定 - Bind rows based on conditions with data.table 在 data.table 的某些列中标记具有相同行的组 - flag groups with identical rows in some columns of a data.table R // 如果满足 data.table 的其他列中的多个条件,则计算行数并求和 col 值 // 高效且快速的 data.table 解决方案 - R // count rows and sum col value if multiple conditions in other columns of a data.table are met // efficient & fast data.table solution R-使用data.table有效测试跨行和跨列的滚动条件 - R - Using data.table to efficiently test rolling conditions across multiple rows and columns 基于多列的数据表排序 - Sorting Data.Table Based on Multiple Columns 如何使用 data.table 创建均值和 sd 列(基于多个条件) - How to create means and s.d. columns with data.table (based on multiple conditions)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM