简体   繁体   中英

Filter a table to keep only non-redundant groups

I've been performing phylogenetic analysis in R for a while, employing libraries like ape, phangorn and phytools.

While solving a problem, I've come to a presence/absence data.frame that specifies if genes of interest belong (or don't) to a certain group.

An example of this would be:

             gene11    gene25  gene33  gene54  gene55 gene65 gene73   gene88
group_1         1        1        0      0       0     0      0       0      
group_2         1        1        1      0       0     0      0       0      
group_3         1        0        1      0       0     0      0       0      
group_4         0        1        1      0       0     0      0       0      
group_5         0        0        0      1       1     0      0       0      
group_6         0        0        0      1       0     0      0       0      
group_7         0        0        0      0       1     0      0       0      
group_8         0        0        0      0       0     1      1       1      
group_9         0        0        0      0       0     1      1       0      
group_10        0        0        0      0       0     1      0       1      
group_11        0        0        0      0       0     0      1       1  

As expected when dealing with groups of biological entities, there many ways in which this entities relate: genes 11, 25 and 33 form a group, and also their relationships could be described smaller groups, depicting pairwise relationships.

So here is the important thing : group_2 , group_5 and group_8 are the biologically relevant groups of genes, and they aren't known beforehand as the relevant groups . The other, smaller groups, arise as a consequence of the relationship shown in these relevant groups: group_1 relates gene11 and gene25, but is a group that is nested in the broader (and relevant) group_2. The same applies in the other cases: group_8 depicts a relationship between gene65, gene73 and gene88; the other groups concerning these genes (group_9, group_10 and group_11) are only subgroups depicting the pairwise relationships existing among the genes that are part of the broader group group_8.

What is known beforehand is that genes form clusters of disjoint groups, each cluster being composed of other (progressively smaller) clusters. I'm interested in capturing the biggest-disjoint groups.

A clear definition of the problem was done by another user (@Shree):

Find minimum number of groups such that all other groups are a sub-group of at least one of those groups. Also a group has to have at least 2 genes ie two 1s in a row. Also assuming, 1,01,0 is a subgroup of 1,1,1,0 but 0,1,1,1 is not a subgroup of 1,1,1,0 .

Thanks to all in advance!

Here's a way using mixed integer programming approach. I am using ompr for mathematical modeling and glpk (free open source) as solver. Modeling logic is provided as comments in code.

I think the problem can be mathematically described as follows -

Filter dataframe to minimize number of rows such that sum of all columns is 1. Selected rows are called primary groups and every other row should be a subgroup of a primary group. A column (gene) can belong to only one primary group. Any unselected row is a subgroup of a primary group when subgroup <= primary group at all positions (columns). Therefore, (0,0,1,1) is subgroup of (0,1,1,1) but (1,0,1,1) is not a subgroup of (0,1,1,1) .

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

gene_mat <- as.matrix(df)
nr <- nrow(gene_mat)
nc <- ncol(gene_mat)

model <- MIPModel() %>% 
  # binary variable x[i] is 1 if row i is selected else 0
  add_variable(x[i], i = 1:nr, type = "binary") %>% 
  # minimize total rows selected
  set_objective(sum_expr(x[i], i = 1:nr), "min") %>% 
  # sum of columns of selected rows must be = 1
  add_constraint(sum_expr(gene_mat[i,j]*x[i], i = 1:nr) == 1, j = 1:nc) %>% 
  solve_model(with_ROI(solver = "glpk"))

# get rows selected
group_rows <- model %>% 
  get_solution(x[i]) %>% 
  filter(value > 0) %>% 
  pull(i) %>% 
  print()

result <- df[group_rows, ]

        gene11 gene25 gene33 gene54 gene55 gene65 gene73 gene88
group_2      1      1      1      0      0      0      0      0
group_5      0      0      0      1      1      0      0      0
group_8      0      0      0      0      0      1      1      1

Important Note -

The above formulation does not address subgroup <= primary group but instead relies on the fact that OP mentions "What is known beforehand is that genes form clusters of disjoint groups" . This means cases like shown below do not exist in data since rows 1,3,4 do not form disjoint groups ie column 3 would belong to 2 primary groups which is not allowed.

1 1 0 0 0
0 1 0 0 0
1 0 1 0 0 <- this row is not a subgroup of any row
0 0 1 1 1

Anyways, here's code to do a safety check to make sure all unselected rows are subgroup of only one primary group -

test <- lapply(group_rows, function(x) {
  sweep(df, 2, as.numeric(df[x, ]), "<=") %>% 
    {which(rowSums(.) == ncol(df))}
})

# all is okay if below returns TRUE
length(Reduce(intersect, test)) == 0

Data -

df <- structure(list(
  gene11 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), 
  gene25 = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
  gene33 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
  gene54 = c(0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), 
  gene55 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L), 
  gene65 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L), 
  gene73 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L), 
  gene88 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L)), 
  class = "data.frame", 
  row.names = c("group_1", "group_2", "group_3", "group_4", 
                "group_5", "group_6", "group_7", "group_8",
                "group_9", "group_10", "group_11")
)

Here is one option where we split the data into chunks of data columns with a grouping index created with rep (here, the 1 and 2 corresponds to the first 3 and next 2 columns), then loop through the list , use filter_all to extract the rows with all 1s and mutate by replacing the NA to 0

library(dplyr)
library(purrr)
library(tibble)
split.default(df, rep(1:2, c(3, 2))) %>%
     map_dfr(~ .x %>%
                  rownames_to_column('rn') %>% 
                  filter_at(-1, all_vars(.==1))) %>% 
        mutate_all(replace_na, 0) %>%
        column_to_rownames('rn')
#.       gene1 gene2 gene3 gene4 gene5
#group_1     1     1     1     0     0
#group_7     0     0     0     1     1

data

df <- structure(list(gene1 = c(1L, 0L, 1L, 1L, 0L, 0L, 0L), gene2 = c(1L, 
1L, 1L, 0L, 0L, 0L, 0L), gene3 = c(1L, 1L, 0L, 1L, 0L, 0L, 0L
), gene4 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L), gene5 = c(0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("gene1", "gene2", "gene3", "gene4", 
"gene5"), class = "data.frame", row.names = c("group_1", "group_2", 
"group_3", "group_4", "group_5", "group_6", "group_7"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM