R data.table sum number of columns exceeding threshold

Question

I would like to sum the number of columns whose values exceed a threshold in an observation. Additionally, I would like to specify those column names and thresholds as vectors ( cols , th )

Take the example data set:

x <- data.table(x1=c(1,2,3),x2=c(3,2,1))

The goal is to create a new column exceed.count with number of columns in which x1 and x2 exceed a respective threshold. Assuming the case in which the thresholds for both x1 and x2 are 2:

th <- c(2,2)

The function could be defined as:

fn <- function(z,th) (sum(z[,x1]>th[1],z[,x2]>th[2]))

And the number of columns exceeding the thresholds calculated by:

x[,exceed.count:=fn(.SD,th),by=seq_len(nrow(x))]

The results are:

   x1 x2 exceed.count
1:  1  3            1
2:  2  2            0
3:  3  1            1

What I would like to do is be able to specify the column names as vector, eg

cols <- c("x1","x2")

I was playing around with a function of the form:

fn.i <- function(z,i) (sum(z[,cols[i],with=FALSE] > th[i]))

which works for a single i, but how do I vectorize this across elements of cols? ( cols and th will always be the same length)

Answer 1

I think there is an easier way to solve your problem:

x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
th<-c(2,2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x))]

Or, taking into account your input (only a subset of columns):

x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x)), .SDcols=sd.cols]

Or

x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2,2)
x[,exceed.count:=sum(.SD>th[1]),by=seq_len(nrow(x)), .SDcols=sd.cols]

Answer 2

@JonnyCrunch's approach, specifying a subset of columns with .SDcols=sd.cols works fine (as long as you ensure ncol(x) == length(th) , otherwise vector recycling will mess things up).

Here's an alternative that is shorter syntax (but will be less performant for very wide columns):

x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
- no need to explicitly specify .SDcols , let it default to all columns
- define the threshold vector th for all columns, using the don't-care value +Inf in those columns you don't want counted.

.

> x <- data.table(x0=4:6, x1=1:3, x2=3:1, x3=7:5)

   x0 x1 x2 x3
1:  4  1  3  7
2:  5  2  2  6
3:  6  3  1  5

> th <- c(+Inf, 2, +Inf, 2) 

> fn <- function(z,th) (z>th)

> x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]

   x0 x1 x2 x3 exceed.count
1:  4  1  3  7            1
2:  5  2  2  6            1
3:  6  3  1  5            2

Answer 3

Here's one way to get around iteration over rows:

x <- data.table(x1=c(1,2,3), x2=c(3,2,1))
thL <- list(x1 = 2, x2 = 2)

nm = names(thL)
x[, n := 0L]
for (i in seq_along(thL)) x[thL[i], on=sprintf("%s>%s", nm[i], nm[i]), n := n + 1L][]

   x1 x2 n
1:  1  3 1
2:  2  2 0
3:  3  1 1

R data.table sum number of columns exceeding threshold

Question

3 answers

solution1
1 ACCPTED 2019-03-01 19:42:55

solution2
1 2019-03-05 10:43:55

solution3
0 2019-03-07 00:02:42

R data.table sum number of columns exceeding threshold

Question

3 answers

solution1 1 ACCPTED 2019-03-01 19:42:55

solution2 1 2019-03-05 10:43:55

solution3 0 2019-03-07 00:02:42

solution1
1 ACCPTED 2019-03-01 19:42:55

solution2
1 2019-03-05 10:43:55

solution3
0 2019-03-07 00:02:42