I have several character vectors of genes containing names of the species in which they're found, and I made an UpSetR plot to show the number of species in common across genes. Now I'd like to do the opposite: Plotting the number of genes in common across species, yet I don't know how to do it.
Example of what I have:
gene1 <- c("Panda", "Dog", "Chicken")
gene2 <- c("Human", "Panda", "Dog")
gene3 <- c("Human", "Panda", "Chicken")
...#About 20+ genes with 100+ species each
Example of what I would like to have as a result:
Panda <- c("gene1", "gene2", "gene3")
Dog <- c("gene1", "gene2")
Human <- c("gene2", "gene3")
Chicken <- c("gene1", "gene3")
...
I know it is conceptually easy, yet logistically more complicated. Can anyone give me a clue?
Thank you!
You can use unstack
from base R:
unstack(stack(mget(ls(pattern="gene"))),ind~values)
$Chicken
[1] "gene1" "gene3"
$Dog
[1] "gene1" "gene2"
$Human
[1] "gene2" "gene3"
$Panda
[1] "gene1" "gene2" "gene3"
You can end up listing this to the environment by list2env
function
Breakdown:
l = mget(ls(pattern="gene"))#get all the genes in a list
m = unstack(stack(l),ind~values)# Stack them, then unstack with the required formula
m
$Chicken
[1] "gene1" "gene3"
$Dog
[1] "gene1" "gene2"
$Human
[1] "gene2" "gene3"
$Panda
[1] "gene1" "gene2" "gene3"
list2env(m,.GlobalEnv)
Dog
[1] "gene1" "gene2"
First of all I think for most purposes it's better to store gene
vectors in a list, as in
genes <- list(gene1 = gene1, gene2 = gene2, gene3 = gene3)
Then one base R approach would be
genes.v <- unlist(genes)
names(genes.v) <- rep(names(genes), times = lengths(genes))
species <- lapply(unique(genes.v), function(g) names(genes.v)[g == genes.v])
names(species) <- unique(genes.v)
species
# $Panda
# [1] "gene1" "gene2" "gene3"
#
# $Dog
# [1] "gene1" "gene2"
#
# $Chicken
# [1] "gene1" "gene3"
#
# $Human
# [1] "gene2" "gene3"
genes.v
is a named vector of all the species with the genes being their names. However, when to species have the same, eg, gene1
, then those names are gene11
and gene12
. That's what I fix in the second line. Then in the third line I go over all the species and create the resulting list, except that in the fourth line I add species names.
Put the data in a list, to begin with. That makes it easier to work with.
genes <- list(
gene1 = c("Panda", "Dog", "Chicken"),
gene2 = c("Human", "Panda", "Dog"),
gene3 = c("Human", "Panda", "Chicken")
)
Then we can get the species names from there.
species <- unique(unlist(genes))
With this data
> species
[1] "Panda" "Dog" "Chicken" "Human"
For each of these, we want to check if the name is contained in a gene. That is a job for Map
(or its cousin lapply
, but I like Map
):
get_genes_for_species <- function(s) {
contained <- unlist(Map(function(gene) s %in% gene, genes))
names(genes)[contained]
}
genes_per_species <- Map(get_genes_for_species, species)
Now you have a list of lists, one list per species, containing the genes found in that species.
> genes_per_species
$Panda
[1] "gene1" "gene2" "gene3"
$Dog
[1] "gene1" "gene2"
$Chicken
[1] "gene1" "gene3"
$Human
[1] "gene2" "gene3"
You can try this.
gene <-unique(c(gene1,gene2,gene3))
TF <-data.frame(Species = gene)
TF$gene1 <- gene%in%gene1
TF$gene2 <- gene%in%gene2
TF$gene3 <- gene%in%gene3
> TF
Species gene1 gene2 gene3
1 Panda TRUE TRUE TRUE
2 Dog TRUE TRUE FALSE
3 Chicken TRUE FALSE TRUE
4 Human FALSE TRUE TRUE
Here's a variation that embraces the tidyverse and puts the result in a neat dataframe.
The trick is to concatenate results with str_c
and summarise
.
tibble(gene1 = gene1,
gene2 = gene2,
gene3 = gene3) %>%
gather(gene_name, gene_type) %>%
group_by(gene_type) %>%
summarise(genes = str_c(gene_name, collapse = ", "))
# A tibble: 4 x 2
gene_type genes
<chr> <chr>
1 Chicken gene1, gene3
2 Dog gene1, gene2
3 Human gene2, gene3
4 Panda gene1, gene2, gene3
I agree with Julius (above) that best way to store gene vectors is with a list. A named list would be even better, as:
my_gene_list <- set_names(list(gene1, gene2, gene3), str_c("gene", 1:3) )
This would neatly produce the same result...
my_gene_list %>% as_tibble() %>%
gather(gene_name, gene_type) %>%
group_by(gene_type) %>%
summarise(genes = str_c(gene_name, collapse = ", "))
# A tibble: 4 x 2
gene_type genes
<chr> <chr>
1 Chicken gene1, gene3
2 Dog gene1, gene2
3 Human gene2, gene3
4 Panda gene1, gene2, gene3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.