I have a matrix with n rows of observations. Observations are frequency distributions of the features. I would like to transform the frequency distributions to probability distributions where the sum of each row is 1. Therefore each element in the matrix should be divided by the sum of the row of the element.
I wrote the following R function that does the work but it is very slow with large matrices:
prob_dist <- function(x) {
row_prob_dist <- function(row) {
return (t(lapply(row, function(x,y=sum(row)) x/y)))
}
for (i in 1:nrow(x)) {
if (i==1) p_dist <- row_prob_dist(x[i,])
else p_dist <- rbind(p_dist, row_prob_dist(x[i,]))
}
return(p_dist)
}
B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
B
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
prob_dist(B)
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3 0.7
Could you suggest R function that does the job and/or tell me how can I optimise my function to perform faster?
Here's an attempt, but on a dataframe instead of a matrix:
df <- data.frame(replicate(100,sample(1:10, 10e4, rep=TRUE)))
I tried a dplyr
approach:
library(dplyr)
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
Here are the results:
library(microbenchmark)
mbm = microbenchmark(
dplyr = df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(df) / rep(rowSums(df), each=ncol(df))),
apply = t(apply(df, 1, prop.table)),
times = 100
)
#> mbm
#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 123.1894 124.1664 137.7076 127.3376 131.1523 445.8857 100
# t 384.6002 390.2353 415.6141 394.8121 408.6669 787.2694 100
# apply 1425.0576 1520.7925 1646.0082 1599.1109 1734.3689 2196.5003 100
Edit : @David benchmark is more in line with OP so I suggest you consider his approach if you are to work with matrices.
Without apply, a vectorized solution in one line:
t(t(B) / rep(rowSums(B), each=ncol(B)))
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3000000 0.7000000
Or:
diag(1/rowSums(B)) %*% B
Actually I gave it a quick thought and the best vecotization would be simply
B/rowSums(B)
# [,1] [,2]
# [1,] 0.6666667 0.3333333
# [2,] 0.4444444 0.5555556
# [3,] 0.3000000 0.7000000
Actually @Stevens benchmark was misleading because OP has a matrix, while Steven benchmark on a data frame.
Here's a benchmark with a matrix . So for matrices, both vectorized solution will be better than dplyr
which doesn't work with matrices
set.seed(123)
m <- matrix(sample(1e6), ncol = 100)
library(dplyr)
library(microbenchmark)
Res <- microbenchmark(
dplyr = as.data.frame(m) %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(m) / rep(rowSums(m), each=ncol(m))),
apply = t(apply(m, 1, prop.table)),
DA = m/rowSums(m),
times = 100
)
I'm not sure that your function has any value, since you could just use the hist
or density
functions to accomplish the same result. Also, the use of apply
would work as mentioned. But it serves as a reasonable programming example.
There are several inefficiencies in your code.
You are using rbind
instead of pre-allocating space for your output. This is extremely expensive as well.
out <- matrix(NA, nrow= n, ncol= ncol(B)) for (i in 1:nrow(B)) { out[i,] <- row_prob_dist(B[i,]) }
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.