简体   繁体   中英

quantile on a matrix in long format

I am trying to compute quantiles on a matrix represented as a data.table in long format (rowid, colid, value). To do this I am converting it into Matrix::sparseMatrix and then computing the quantiles. I was wondering if there is a more efficient way to do this? (using R 3.2.1 and data.table 1.9.5 from github)

require(data.table)
require(Matrix)

set.seed(100)
nobs <- 1000   #num rows in matrix
nvar <- 10    #num columns in matrix
density <- .1  #fraction of non-zero values in matrix

nrow <- round(density*nobs*nvar)
data.dt <- unique(data.table(obsid=sample(1:nobs,nrow,replace=T), 
        varid=sample(1:nvar,nrow,replace=T)))
data.dt <- data.dt[, value:=runif(.N)]

probs <- c(1,5,10,25,50,75,90,95,100)

#approach 1
system.time({
data.mat <- sparseMatrix(i=data.dt[,obsid], j=data.dt[,varid], x=data.dt[,value], dims=c(nobs,nvar))
quantile1.dt <- data.table(t(sapply(1:nvar, function(n) c(n,quantile(data.mat[,n], probs=probs/100, names=FALSE)))))
quantile1.dt <- setnames(quantile1.dt, c("varid",sprintf("p%02d",probs)))[order(varid)]
})

#approach 2
system.time({
quantile2.dt <- data.dt[, as.list(quantile(c(rep(0,nobs-.N), value), probs=probs/100, names=FALSE)), by=varid]
quantile2.dt <- setnames(quantile2.dt, c("varid",sprintf("p%02d",probs)))[order(varid)]
})

all.equal(quantile1.dt, quantile2.dt)

Update I found the answer to this and wanted to share, in case somebody else finds it useful! My original question was approach 1. The better way to compute the same is approach 2. The real advantage of approach 2 is seen when nobs and nvar is large. For example, when nobs=100,000 and nvar=1,000 approach1 takes 27sec while approach2 takes 4sec!

By your description, it was a little hard (for me) to see what you wanted to do, so I'll make a basic example.

set.seed(100)
nrow <- 10
ncol <- 5
prop <- 0.1
nobs <- round(prop*nrow*ncol)
s1 <- c(5,7,8,8,9) # sample(1:nrow, nobs, replace=T)
s2 <- c(1,3,3,4,4) # sample(1:ncol, nobs, replace=T)

# unique pairs
arr <- unique(array(c(s1,s2), dim=c(nobs,2)))

# random num for each unique pair
s3 <- c(0.1, 0.5, 0.8, 0.2, 0.4) # runif(length(arr[,1]))

# show data
data.frame(v1=arr[,1], v2=arr[,2], v3=s3)

#   v1 v2  v3
# 1  5  1 0.1
# 2  7  3 0.5
# 3  8  3 0.8
# 4  8  4 0.2
# 5  9  4 0.4

In this case, the sparse matrix representation is:

sm <- sparseMatrix(i=s1, j=s2, x=s3) # since all pairs are unique here

# row 1 corresponds to s1=1, ..., row 9 corresponds to s1=9
# column 1 corresponds to s2=1, ... column 4 corresponds to s2=4
sm

# [1,] .   . .   .  
# [2,] .   . .   .
# [3,] .   . .   .  
# [4,] .   . .   .  
# [5,] 0.1 . .   .  
# [6,] .   . .   .  
# [7,] .   . 0.5 .  
# [8,] .   . 0.8 0.2  
# [9,] .   . .   0.4

The values corresponding to s2=1 are (0,0,0,0,0.1,0,0,0,0,0)' , and so on. We can find the quantiles of each of these columns with:

q <- c(0.25, 0.5, 0.75, 1.0) # quantiles 
data.table(t(sapply(1:4, function(n) c(n,quantile(sm[,n], q)))))

#    V1 25% 50% 75% 100%
# 1:  1   0   0   0  0.1
# 2:  2   0   0   0  0.0
# 3:  3   0   0   0  0.8
# 4:  4   0   0   0  0.4

(Note that here there are 9 zeros but there should be 10. Notice that if I had use 1:ncol in the sapply() function above, it wouldn't have worked since sm only has 4 columns. I think that using the sparseMatrix() function for quantiles might not always work for this reason)

What is the fastest way to do this? Suppose the variables above s1, s2, s3, nrow, ncol, arr are defined as above. Suppose you want the quantile of s3 for s2 = 1 . You could do this for instance

tmp <- s2==1
quantile( c( s3[tmp], rep(0, nrow-sum(tmp)) ), q)

This kind of approach could potentially be better, but I think that for large data sets the sparseMatrix approach should work well

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM