简体   繁体   中英

R: Matrix algebra in R: assign values to empty matrix from another matrix and table

I have a sparse matrix of species presences and absences across many sites ie a sites by species matrix. I also have another table of the species grouped in clusters.

What I would like to have is a new matrix (also a sites x species matrix) that captures the count of each species per cluster.

I have a for loop that can do the job, but my original matrix is very big comprising several thousands of rows and columns, and a for loop will be computationally expensive. Any idea on how to achieve this using matrix algebra or something more elegant?

My dataset is arranged as follows:

# 1. Species grouped in clusters

memb <- c(1,2,3,4,2,5)
names(memb) <- c("s1", "s2", "s3", "s4", "s5", "s6")
 

# 2. Number of clusters

z <- length(unique(memb))
z

[1] 5


# 3. Community matrix of species presence-absence across sites
d <- data.frame(grids=c("v1", "v1", "v2", "v2", "v3", "v3", "v3", 
                         "v1", "v3", "v4", "v2", "v1", "v4", "v3"),
                 sp=c("s1", "s3", "s2", "s3", "s1", "s2", "s3", 
                           "s4", "s4", "s4", "s5", "s6", "s5", "s5"))
M <- as.data.frame.matrix(table(d$grids, d$sp))

library(Matrix)
M <- Matrix(as.matrix(M), sparse=TRUE)
M

4 x 6 sparse Matrix of class "dgCMatrix"
   s1 s2 s3 s4 s5 s6
v1  1  .  1  1  .  1
v2  .  1  1  .  1  .
v3  1  1  1  1  1  .
v4  .  .  .  1  1  .

# 4. create empty matrix collapsed proportional to the original diversity.
cluster_comm <- Matrix(0, dim(M)[[1]], z)
tmp_comm <- Matrix(0, dim(M)[[1]], dim(M)[[2]])
rownames(cluster_comm) <- rownames(M)
rownames(tmp_comm) <- rownames(M)
colnames(tmp_comm) <- names(memb)[order(memb, decreasing = FALSE)]
colnames(cluster_comm) <- colnames(M)[1:z]

# 5. For loop to add species to reduced empty matrix based on group membership
for (m in 1:dim(M)[[1]]) {
  tmp_comm[m, ] <- as.numeric(M[m, names(memb)[order(memb,decreasing = FALSE)]])
  for (i in 1:z) {
    names <- names(memb)[memb == i]
    cluster_comm[m, i] <- sum(tmp_comm[m, names])
    colnames(cluster_comm)[i] <- names[[1]]
  }
}

# Expected outcome
cluster_comm

4 x 5 sparse Matrix of class "dgCMatrix"
   s1 s2 s3 s4 s6
v1  1  .  1  1  1
v2  .  2  1  .  .
v3  1  2  1  1  .
v4  .  1  .  1  .

Starting from step 3, ugly solution that may be somewhat faster. Not sure whether the merge makes sense in your case though, test this with larger data. Changing M to data.table might also increase performance.

library(magrittr)
library(reshape2)
M <- as.data.frame.matrix(table(d$grids, d$sp))

M %>% cbind(.,rw = rownames(.)) %>%
      melt %>%
      merge(.,
             (memb %>% 
                   melt %>% 
                   cbind(., gr = rownames(.))), by.x = 'variable',
              by.y = 'gr',
              all.x = T) %>%
       dcast(., rw ~ value.y , value.var = 'value.x', fun.aggregate = sum )

Should return:

Using rw as id variables
  rw 1 2 3 4 5
1 v1 1 0 1 1 1
2 v2 0 2 1 0 0
3 v3 1 2 1 1 0
4 v4 0 1 0 1 0

Sparse matrix:

Assuming M is a sparse matrix,

library(data.table)
m2 <- as(M, "dgTMatrix")
dt2 <- data.table(row=m2@i+1, col=m2@j+1, value=m2@x)

#either match names of memb with cols, or the other way around..
names(memb) %<>% gsub('s','',.) 
dt2 %>% merge(., 
              memb %>%
                   melt %>%
                   cbind(., rw = rownames(.) %>% as.numeric),
              by.x = 'col',
              by.y = 'rw', 
              all.x=T ) %>% 
        dcast(., row ~ value.y , value.var = 'value.x', fun.aggregate = sum )

Should return:

row 1 2 3 4 5
1:   1 1 0 1 1 1
2:   2 0 2 1 0 0
3:   3 1 2 1 1 0
4:   4 0 1 0 1 0

It's basically the same solution as Nutle's, but using dplyr and tidyr .

First I transform memb into a data.frame:

memb <- c(1,2,3,4,2,5)
names(memb) <- c("s1", "s2", "s3", "s4", "s5", "s6")

membership <- data.frame(id=names(memb), group=memb)

Next I join it with d :

d %>% 
  left_join(membership, by=c("sp" = "id")) %>%
  count(grids, group) %>%
  pivot_wider(names_from="group", 
              names_prefix="s", 
              names_sort=TRUE, 
              values_from="n", 
              values_fill=0)

which yields

# A tibble: 4 x 6
  grids    s1    s2    s3    s4    s5
  <chr> <int> <int> <int> <int> <int>
1 v1        1     0     1     1     1
2 v2        0     2     1     0     0
3 v3        1     2     1     1     0
4 v4        0     1     0     1     0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM