简体   繁体   中英

Summing the difference of all values of one vector that are less than the values in another

I have the following code below to try and loop through a sequence and select values below these values in a sequence and find the difference from another value. For large datasets, this can take a long time. Is there a way to vectorize something like this without looping through the sequence to improve performance?

a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)

c <- vector('list', length(b))

i <- 1
for (n in b){
    c[[i]] <- sum(n - a[n >= a])
    i <- i + 1
}

data.frame(c)

I've tried to use data.table to bin the data and find the difference, but cannot figure out how to calculate the difference from every value less than the bin value.

library(data.table)

min.n <- 1
max.n <- 10 
a <- data.table(seq(min.n, max.n, by=0.5))
colnames(a) <- 'a'
b <- seq(min.n+1, max.n+1, by=1)

bins <- findInterval(a$a,b)
a[,bins:= bins+2]
a[, diff:= bins - a]

Here is an option using data.table using rolling join:

library(data.table)
A <- data.table(a, key="a")
B <- data.table(b, key="b")

A[, c("N", "cs") := .(.I, cumsum(a))]

A[B, on=.(a=b), roll=Inf, N * b - cs]

sum a[a <= n] can be replaced with cumsum (ie cs here) and rolling join will find those a that are less than b . Replace sum(n - cs) with a mathematical formula involving the summation symbol so that sum(constant) = number of elements in summation * constant.

output:

[1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

edit: some timings for reference

timing code:

set.seed(0L)
library(data.table)
n <- 1e5L
a <- rnorm(n)
b <- rnorm(n/10L)
A <- data.table(a, key="a")
B <- data.table(b, key="b")

mtd0 <- function() A[B, on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1

mtd1 <- function() {
    A[, c("N", "cs") := .(.I, cumsum(a))]
    A[B, on=.(a=b), roll=Inf, N * b - cs]
}

all.equal(mtd0(), mtd1())
#[1] TRUE

microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())

timings:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd0() 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000     1
 mtd1()    7.807637    7.807637    7.807637    7.807637    7.807637    7.807637     1

With data.table , this can be achieved by aggregating in a non-equi join :

library(data.table)
data.table(a)[data.table(b), on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1
 [1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5

In a way, it is similar to MattB's approach but combines the cartesian product CJ() and subsetting in the non-equi join thereby avoiding to create data which will be filtered out subsequently.

Note that the x. prefix is required to pick the a column from the first data.table.


Alternatively, sum(ib - xa) can be re-written as .N * b - sum(xa) where the special symbol .N denotes the number of rows in a group.

data.table(a)[data.table(b), on = .(a <= b), .N * b - sum(x.a), by = .EACHI]$V1
 [1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5

A base R solution with findInterval , which is fast.

i <- findInterval(b, a)
sapply(seq_along(i), function(j)sum(b[j] - a[1:i[j]]))
# [1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

Something like this?

library(data.table)
a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)

all.combinations <- CJ(a, b)  # Get all possible combinations
all.combinations[b>=a, sum(b-a), by=b]  # Filter for b>=a, then sum the difference for each value of b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM