I have a sparse matrix of species presences and absences across many sites ie a sites by species matrix. I also have another table of the species grouped in clusters.
What I would like to have is a new matrix (also a sites x species matrix) that captures the count of each species per cluster.
I have a for loop that can do the job, but my original matrix is very big comprising several thousands of rows and columns, and a for loop will be computationally expensive. Any idea on how to achieve this using matrix algebra or something more elegant?
My dataset is arranged as follows:
# 1. Species grouped in clusters
memb <- c(1,2,3,4,2,5)
names(memb) <- c("s1", "s2", "s3", "s4", "s5", "s6")
# 2. Number of clusters
z <- length(unique(memb))
z
[1] 5
# 3. Community matrix of species presence-absence across sites
d <- data.frame(grids=c("v1", "v1", "v2", "v2", "v3", "v3", "v3",
"v1", "v3", "v4", "v2", "v1", "v4", "v3"),
sp=c("s1", "s3", "s2", "s3", "s1", "s2", "s3",
"s4", "s4", "s4", "s5", "s6", "s5", "s5"))
M <- as.data.frame.matrix(table(d$grids, d$sp))
library(Matrix)
M <- Matrix(as.matrix(M), sparse=TRUE)
M
4 x 6 sparse Matrix of class "dgCMatrix"
s1 s2 s3 s4 s5 s6
v1 1 . 1 1 . 1
v2 . 1 1 . 1 .
v3 1 1 1 1 1 .
v4 . . . 1 1 .
# 4. create empty matrix collapsed proportional to the original diversity.
cluster_comm <- Matrix(0, dim(M)[[1]], z)
tmp_comm <- Matrix(0, dim(M)[[1]], dim(M)[[2]])
rownames(cluster_comm) <- rownames(M)
rownames(tmp_comm) <- rownames(M)
colnames(tmp_comm) <- names(memb)[order(memb, decreasing = FALSE)]
colnames(cluster_comm) <- colnames(M)[1:z]
# 5. For loop to add species to reduced empty matrix based on group membership
for (m in 1:dim(M)[[1]]) {
tmp_comm[m, ] <- as.numeric(M[m, names(memb)[order(memb,decreasing = FALSE)]])
for (i in 1:z) {
names <- names(memb)[memb == i]
cluster_comm[m, i] <- sum(tmp_comm[m, names])
colnames(cluster_comm)[i] <- names[[1]]
}
}
# Expected outcome
cluster_comm
4 x 5 sparse Matrix of class "dgCMatrix"
s1 s2 s3 s4 s6
v1 1 . 1 1 1
v2 . 2 1 . .
v3 1 2 1 1 .
v4 . 1 . 1 .
Starting from step 3, ugly solution that may be somewhat faster. Not sure whether the merge
makes sense in your case though, test this with larger data. Changing M
to data.table
might also increase performance.
library(magrittr)
library(reshape2)
M <- as.data.frame.matrix(table(d$grids, d$sp))
M %>% cbind(.,rw = rownames(.)) %>%
melt %>%
merge(.,
(memb %>%
melt %>%
cbind(., gr = rownames(.))), by.x = 'variable',
by.y = 'gr',
all.x = T) %>%
dcast(., rw ~ value.y , value.var = 'value.x', fun.aggregate = sum )
Should return:
Using rw as id variables
rw 1 2 3 4 5
1 v1 1 0 1 1 1
2 v2 0 2 1 0 0
3 v3 1 2 1 1 0
4 v4 0 1 0 1 0
Assuming M
is a sparse matrix,
library(data.table)
m2 <- as(M, "dgTMatrix")
dt2 <- data.table(row=m2@i+1, col=m2@j+1, value=m2@x)
#either match names of memb with cols, or the other way around..
names(memb) %<>% gsub('s','',.)
dt2 %>% merge(.,
memb %>%
melt %>%
cbind(., rw = rownames(.) %>% as.numeric),
by.x = 'col',
by.y = 'rw',
all.x=T ) %>%
dcast(., row ~ value.y , value.var = 'value.x', fun.aggregate = sum )
Should return:
row 1 2 3 4 5
1: 1 1 0 1 1 1
2: 2 0 2 1 0 0
3: 3 1 2 1 1 0
4: 4 0 1 0 1 0
It's basically the same solution as Nutle's, but using dplyr
and tidyr
.
First I transform memb
into a data.frame:
memb <- c(1,2,3,4,2,5)
names(memb) <- c("s1", "s2", "s3", "s4", "s5", "s6")
membership <- data.frame(id=names(memb), group=memb)
Next I join it with d
:
d %>%
left_join(membership, by=c("sp" = "id")) %>%
count(grids, group) %>%
pivot_wider(names_from="group",
names_prefix="s",
names_sort=TRUE,
values_from="n",
values_fill=0)
which yields
# A tibble: 4 x 6
grids s1 s2 s3 s4 s5
<chr> <int> <int> <int> <int> <int>
1 v1 1 0 1 1 1
2 v2 0 2 1 0 0
3 v3 1 2 1 1 0
4 v4 0 1 0 1 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.