如何从向量R中的共同元素创建向量

Question

I have several character vectors of genes containing names of the species in which they're found, and I made an UpSetR plot to show the number of species in common across genes. 我有几个基因特征载体，其中包含了它们所在物种的名称，我制作了一个UpSetR图，显示了基因间共同的物种数量。 Now I'd like to do the opposite: Plotting the number of genes in common across species, yet I don't know how to do it. 现在我想做相反的事情：绘制物种间共同基因的数量，但我不知道该怎么做。

Example of what I have: 我的例子：

gene1 <- c("Panda", "Dog", "Chicken")
gene2 <- c("Human", "Panda", "Dog")
gene3 <- c("Human", "Panda", "Chicken")  
...#About 20+ genes with 100+ species each

Example of what I would like to have as a result: 我希望得到的结果示例：

Panda <- c("gene1", "gene2", "gene3")
Dog <- c("gene1", "gene2")
Human <- c("gene2", "gene3")
Chicken <- c("gene1", "gene3")
...

I know it is conceptually easy, yet logistically more complicated. 我知道它在概念上很容易，但后勤更复杂。 Can anyone give me a clue? 任何人都可以给我一个线索吗？

Thank you! 谢谢！

Answer 1

You can use unstack from base R: 你可以使用基础R的unstack ：

unstack(stack(mget(ls(pattern="gene"))),ind~values)
$Chicken
[1] "gene1" "gene3"

$Dog
[1] "gene1" "gene2"

$Human
[1] "gene2" "gene3"

$Panda
[1] "gene1" "gene2" "gene3"

You can end up listing this to the environment by list2env function 您最终可以通过list2env函数将此列表添加到环境中

Breakdown: 分解：

 l = mget(ls(pattern="gene"))#get all the genes in a list
 m = unstack(stack(l),ind~values)# Stack them, then unstack with the required formula
 m
$Chicken
[1] "gene1" "gene3"

$Dog
[1] "gene1" "gene2"

$Human
[1] "gene2" "gene3"

$Panda
[1] "gene1" "gene2" "gene3"

 list2env(m,.GlobalEnv)
 Dog
 [1] "gene1" "gene2"

Answer 2

First of all I think for most purposes it's better to store gene vectors in a list, as in 首先，我认为在大多数情况下，最好将gene载体存储在列表中，如

genes <- list(gene1 = gene1, gene2 = gene2, gene3 = gene3)

Then one base R approach would be 那么一个基础R方法就是

genes.v <- unlist(genes)
names(genes.v) <- rep(names(genes), times = lengths(genes))
species <- lapply(unique(genes.v), function(g) names(genes.v)[g == genes.v])
names(species) <- unique(genes.v)
species
# $Panda
# [1] "gene1" "gene2" "gene3"
#
# $Dog
# [1] "gene1" "gene2"
#
# $Chicken
# [1] "gene1" "gene3"
#
# $Human
# [1] "gene2" "gene3"

genes.v is a named vector of all the species with the genes being their names. genes.v是所有物种的命名载体，其基因是它们的名称。 However, when to species have the same, eg, gene1 , then those names are gene11 and gene12 . 然而，当物种具有相同的例如gene1 ，那些名称是gene11和gene12 。 That's what I fix in the second line. 这就是我在第二行中修复的内容。 Then in the third line I go over all the species and create the resulting list, except that in the fourth line I add species names. 然后在第三行我遍历所有物种并创建结果列表，除了在第四行我添加物种名称。

Answer 3

Put the data in a list, to begin with. 将数据放在列表中，首先。 That makes it easier to work with. 这样可以更轻松地使用。

genes <- list(
    gene1 = c("Panda", "Dog", "Chicken"),
    gene2 = c("Human", "Panda", "Dog"),
    gene3 = c("Human", "Panda", "Chicken")
)

Then we can get the species names from there. 然后我们可以从那里获得物种名称。

species <- unique(unlist(genes))

With this data 有了这些数据

> species
[1] "Panda"   "Dog"     "Chicken" "Human"

For each of these, we want to check if the name is contained in a gene. 对于其中的每一个，我们想检查名称是否包含在基因中。 That is a job for Map (or its cousin lapply , but I like Map ): 这是Map （或其堂兄lapply ，但我喜欢Map ）的工作：

get_genes_for_species <- function(s) {
    contained <- unlist(Map(function(gene) s %in% gene, genes))
    names(genes)[contained]
}
genes_per_species <- Map(get_genes_for_species, species)

Now you have a list of lists, one list per species, containing the genes found in that species. 现在您有一个列表列表，每个物种一个列表，包含该物种中发现的基因。

> genes_per_species
$Panda
[1] "gene1" "gene2" "gene3"

$Dog
[1] "gene1" "gene2"

$Chicken
[1] "gene1" "gene3"

$Human
[1] "gene2" "gene3"

Answer 4

You can try this. 你可以试试这个。

gene  <-unique(c(gene1,gene2,gene3))
TF    <-data.frame(Species = gene)

TF$gene1 <- gene%in%gene1
TF$gene2 <- gene%in%gene2
TF$gene3 <- gene%in%gene3

> TF
  Species gene1 gene2 gene3
1   Panda  TRUE  TRUE  TRUE
2     Dog  TRUE  TRUE FALSE
3 Chicken  TRUE FALSE  TRUE
4   Human FALSE  TRUE  TRUE

Answer 5

Here's a variation that embraces the tidyverse and puts the result in a neat dataframe. 这是一个包含tidyverse的变体，并将结果放在一个整洁的数据帧中。

The trick is to concatenate results with str_c and summarise . 诀窍是用str_c连接结果并summarise 。

   tibble(gene1 = gene1, 
          gene2 = gene2, 
          gene3 = gene3) %>% 
   gather(gene_name, gene_type) %>% 
   group_by(gene_type) %>% 
   summarise(genes = str_c(gene_name, collapse = ", "))

# A tibble: 4 x 2
  gene_type genes              
  <chr>     <chr>              
1 Chicken   gene1, gene3       
2 Dog       gene1, gene2       
3 Human     gene2, gene3       
4 Panda     gene1, gene2, gene3

I agree with Julius (above) that best way to store gene vectors is with a list. 我同意Julius（上文），存储基因载体的最佳方法是列表。 A named list would be even better, as: 命名列表会更好，如：

my_gene_list <- set_names(list(gene1, gene2, gene3), str_c("gene", 1:3) )

This would neatly produce the same result... 这将巧妙地产生相同的结果......

 my_gene_list %>% as_tibble() %>% 
   gather(gene_name, gene_type) %>% 
   group_by(gene_type) %>% 
   summarise(genes = str_c(gene_name, collapse = ", "))

# A tibble: 4 x 2
  gene_type genes              
  <chr>     <chr>              
1 Chicken   gene1, gene3       
2 Dog       gene1, gene2       
3 Human     gene2, gene3       
4 Panda     gene1, gene2, gene3

如何从向量R中的共同元素创建向量

问题描述

5 个解决方案

解决方案1
8 已采纳 2018-03-19 19:27:04

解决方案2
3 2018-03-19 19:25:57

解决方案3
3 2018-03-19 19:49:59

解决方案4
1 2018-03-19 19:23:02

解决方案5
1 2018-09-16 18:41:00

如何从向量R中的共同元素创建向量

问题描述

5 个解决方案

解决方案1 8 已采纳 2018-03-19 19:27:04

解决方案2 3 2018-03-19 19:25:57

解决方案3 3 2018-03-19 19:49:59

解决方案4 1 2018-03-19 19:23:02

解决方案5 1 2018-09-16 18:41:00

解决方案1
8 已采纳 2018-03-19 19:27:04

解决方案2
3 2018-03-19 19:25:57

解决方案3
3 2018-03-19 19:49:59

解决方案4
1 2018-03-19 19:23:02

解决方案5
1 2018-09-16 18:41:00