简体   繁体   中英

How to convert a list of nested dataframes into a count matrix based on common values in the dataframe

I have a long list of genes. I have added a toy example below.

output of dput(list1)

list(ENDOSS = structure(list(ENDOSS = c("CDKN1C", "SOX6", "TGFB2"
)), row.names = c(NA, -3L), class = "data.frame"), ENDOSSSD = structure(list(
    ENDOSSSD = c("CDKN1C", "SOX6", "TGFB2")), row.names = c(NA, 
-3L), class = "data.frame"), GASTRIN = structure(list(GASTRIN = c("IKBKB", 
"KIT", "SERPINE1")), row.names = c(NA, -3L), class = "data.frame"), 
    METCC = structure(list(METCC = character(0)), row.names = character(0), class = "data.frame"))

The toy list looks as so

list1
    ENDOSS
         "CDKN1C", "SOX6", "TGFB2" 
    ENDOSSSD
         "CDKN1C", "SOX6", "TGFB2"
    GASTRIN
          "IKBKB", "KIT", "SERPINE1"
    METCC

I would like to transform this list into a count matrix. Based on the example, the output should look like this.

             CDKN1C  IKBKB  KIT SERPINE1 SOX6   TGFB2 
    ENDOSS     1       0     0     0       1      1

    ENDOSSSD   1       0     0     0       1      1

    GASTRIN    0       1     1     1       0      0

    METCC      0       0     0     0       0      0

Any help would be appreciated. Thanks.

We can use mtabulate after converting the column to a vector in each of the list elements

library(qdapTools)
mtabulate(lapply(list1, unlist))
         CDKN1C IKBKB KIT SERPINE1 SOX6 TGFB2
ENDOSS        1     0   0        0    1     1
ENDOSSSD      1     0   0        0    1     1
GASTRIN       0     1   1        1    0     0
METCC         0     0   0        0    0     0

One approach could be to combine list of dataframe into one using bind_rows , get the data in long format so that all the values are in same column. From here, you can get it back in wide format with it's counts.

library(dplyr)
library(tidyr)

bind_rows(list1, .id = 'name') %>%
  pivot_longer(cols = -name, names_to = NULL, 
               values_drop_na = TRUE) %>%
  pivot_wider(names_from = value, values_from = value, 
              values_fn = length, values_fill = 0)

#   name     CDKN1C  SOX6 TGFB2 IKBKB   KIT SERPINE1
#  <chr>     <int> <int> <int> <int> <int>    <int>
#1 ENDOSS        1     1     1     0     0        0
#2 ENDOSSSD      1     1     1     0     0        0
#3 GASTRIN       0     0     0     1     1        1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM