简体   繁体   中英

How to merge rows of binary matrix based on substring rowname matches?

If the rownames of the binary matrix match before the 4th . delimiter, merge the two rows, where if either row as 1 , the column value will be 1 . Also, remove everything after the 4th . delimiter in the rownames.

Sample Data:

structure(list(DNMT3A = c(1, 0, 0, 0, 0), IGF2R = c(1, 0, 0, 0, 1), 
    NBEA = c(1, 0, 0, 0, 1), ITGB5 = c(0, 1, 0, 0, 0)), row.names = c("TCGA.2Z.A9J1.01A.11D.A382.10", 
"TCGA.B9.A5W9.01A.11D.A28G.10", "TCGA.2Z.A9JM.01A.13D.A44J.12", "TCGA.GL.A59R.01A.11D.A26P.10", 
"TCGA.2Z.A9JM.01A.12D.A42J.10"), class = "data.frame")

Desired output:

structure(list(DNMT3A = c(1, 0, 0, 0), IGF2R = c(1, 0, 1, 0), 
    NBEA = c(1, 0, 1, 0), ITGB5 = c(0, 1, 0, 0)), row.names = c("TCGA.2Z.A9J1.01A", 
"TCGA.B9.A5W9.01A", "TCGA.2Z.A9JM.01A", "TCGA.GL.A59R.01A"), class = "data.frame")

Try this:

split(dat1, substring(rownames(dat1), 1, 16)) |>
  lapply(function(z) if (nrow(z) == 1) z else t(apply(z, 2, function(z) +any(z > 0)))) |>
  do.call(rbind, args = _)
#                  DNMT3A IGF2R NBEA ITGB5
# TCGA.2Z.A9J1.01A      1     1    1     0
# TCGA.2Z.A9JM.01A      0     1    1     0
# TCGA.B9.A5W9.01A      0     0    0     1
# TCGA.GL.A59R.01A      0     0    0     0

Note that the use of args=_ with |> requires R-4.2.0. Without that, one can use any of the following for the last line in the code block:

... %>% do.call(rbind, .)
... |> (function(z) do.call(rbind, z))()

I'm naively assuming that all rownames have exactly the same number of characters in each . -delimited substring; you may need to adapt the substring(...) if that assumption is not true.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM