简体   繁体   English

重塑R中的数据

[英]Reshape data in R

I want to have a matrix from this data frame. 我想从这个数据框中得到一个矩阵。 The values should be on the basis if there is a relation between a pair of gene then 1, and if not then 0. So ADRA1D and ADK would have value 1, and so would other pairs. 如果一对基因之间存在关联,则该值应为1,如果不是,则为0。因此,ADRA1D和ADK的值为1,其他对也是如此。 But there is no pair of ADK and AR so in that matrix it should be 0. 但是没有一对ADK和AR,因此在该矩阵中应为0。

tab <- read.table(text="ID  gene1   gene2
1   ADRA1D  ADK
2   ADRA1B  ADK
3   ADRA1A  ADK
4   ADRB1   ASIC1
5   ADRB1   ADK
6   ADRB2   ASIC1
7   ADRB2   ADK
8   AGTR1   ACHE
9   AGTR1   ADK
10  ALOX5   ADRB1
11  ALOX5   ADRB2
12  ALPPL2  ADRB1 
13  ALPPL2  ADRB2
14  AMY2A   AGTR1
15  AR  ADORA1
16  AR  ADRA1D
17  AR  ADRA1B
18  AR  ADRA1A
19  AR  ADRA2A
20  AR  ADRA2B", header=TRUE, stringsAsFactors=FALSE)

Primarily, I want to build a phylogenetic tree, so was thinking of having a matrix like that. 首先,我想构建一个系统发育树,因此考虑使用这样的矩阵。 How can I use reshape library for this, since I have no value column? 由于我没有值列,我该如何使用重塑库?

The below code does not work: 以下代码不起作用:

library(reshape)
ct=cast(tab,gene1~gene2)

If it is not mandatory to use reshape I'd suggest taking a look at igraph. 如果不是必须使用reshape我建议您看一下igraph. Here's one way to get the symmetrical matrix using the igraph package. 这是使用igraph包获得对称矩阵的一种方法。 We first convert your data frame (the relevant 2 columns) into an igraph object, and then get_adjacency does the needful. 我们首先将您的数据框(相关的2列)转换为igraph对象,然后get_adjacency进行必要的操作。

library(igraph)
g <- graph.data.frame(tab[,c(2,3)])
get.adjacency(g)

This gives you the adjacency matrix. 这为您提供了邻接矩阵。 You should definitely look into using igraph for the rest of your analysis. 其余的分析肯定要使用igraph。

16 x 16 sparse Matrix of class "dgCMatrix"
   [[ suppressing 16 column names ‘ADRA1D’, ‘ADRA1B’, ‘ADRA1A’ ... ]]

ADRA1D . . . . . . . . . . 1 . . . . .
ADRA1B . . . . . . . . . . 1 . . . . .
ADRA1A . . . . . . . . . . 1 . . . . .
ADRB1  . . . . . . . . . . 1 1 . . . .
ADRB2  . . . . . . . . . . 1 1 . . . .
AGTR1  . . . . . . . . . . 1 . 1 . . .
ALOX5  . . . 1 1 . . . . . . . . . . .
ALPPL2 . . . 1 1 . . . . . . . . . . .
AMY2A  . . . . . 1 . . . . . . . . . .
AR     1 1 1 . . . . . . . . . . 1 1 1
ADK    . . . . . . . . . . . . . . . .
ASIC1  . . . . . . . . . . . . . . . .
ACHE   . . . . . . . . . . . . . . . .
ADORA1 . . . . . . . . . . . . . . . .
ADRA2A . . . . . . . . . . . . . . . .
ADRA2B . . . . . . . . . . . . . . . .

An advantage of using igraph is that many graph-based distance calculation methods are now available for you. 使用igraph的优点是现在可以使用许多基于图的距离计算方法。 Do look into shortest.paths 一定要研究shortest.paths

You can achieve this with the table function : 您可以使用table函数来实现:

> table(tab$gene1, tab$gene2)

         ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ADRA2B ADRB1 ADRB2 AGTR1 ASIC1
  ADRA1A    0   1      0      0      0      0      0      0     0     0     0     0
  ADRA1B    0   1      0      0      0      0      0      0     0     0     0     0
  ADRA1D    0   1      0      0      0      0      0      0     0     0     0     0
  ADRB1     0   1      0      0      0      0      0      0     0     0     0     1
  ADRB2     0   1      0      0      0      0      0      0     0     0     0     1
  AGTR1     1   1      0      0      0      0      0      0     0     0     0     0
  ALOX5     0   0      0      0      0      0      0      0     1     1     0     0
  ALPPL2    0   0      0      0      0      0      0      0     1     1     0     0
  AMY2A     0   0      0      0      0      0      0      0     0     0     1     0
  AR        0   0      1      1      1      1      1      1     0     0     0     0

Use as.matrix if you want a matrix structure. 如果需要矩阵结构,请使用as.matrix

EDIT ## : For a symetric matrix. 编辑##:对于对称矩阵。

To generate a symetric matrix when you use table you need that the two arguments have the same levels, here the values aren't factors but strings then there is no levels but it's the same thing. 要在使用table时生成对称矩阵,您需要两个参数具有相同的级别,这里的值不是因素,而是字符串,则没有级别,但是是同一回事。 You need at least one occurence of each unique gene1 in gene2 and vice versa. 您需要在gene2中每个唯一的gene1至少出现一次,反之亦然。

For that I suggest you to create a vector with all your genes (I used sort(unique(c(unique(tab$gene1), unique(tab$gene2)))) ). 为此,我建议您创建一个包含所有基因的载体(我使用了sort(unique(c(unique(tab$gene1), unique(tab$gene2)))) )。

I merged "gene1" with this vector keeping all the occurences with no correspondances, it will produces NA instead of join with something. 我将“ gene1”与该向量合并,使所有出现的事件都没有对应关系,它将产生NA而不是与某些事物连接。 Same thing for "gene2". “ gene2”也是如此。

Now you have all at least one of each gene possible in "gene1" and "gene2" and you can table . 现在,您在“ gene1”和“ gene2”中至少拥有每个基因中的所有一个,并且可以使用table

genes <- c('ACHE','ADK','ADORA1','ADRA1A','ADRA1B','ADRA1D','ADRA2A','ADRA2B','ADRB1','ADRB2','AGTR1','ALOX5','ALPPL2','AMY2A','AR','ASIC1')

df <- merge(tab, as.data.frame(genes), by.x = "gene1", by.y = "genes", all = TRUE)
df <- merge(df, as.data.frame(genes), by.x = "gene2", by.y = "genes", all = TRUE)

> table(df$gene1, df$gene2)

         ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ADRA2B ADRB1 ADRB2 AGTR1 ALOX5 ALPPL2 AMY2A AR ASIC1
  ACHE      0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADK       0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADORA1    0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRA1A    0   1      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRA1B    0   1      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRA1D    0   1      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRA2A    0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRA2B    0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ADRB1     0   1      0      0      0      0      0      0     0     0     0     0      0     0  0     1
  ADRB2     0   1      0      0      0      0      0      0     0     0     0     0      0     0  0     1
  AGTR1     1   1      0      0      0      0      0      0     0     0     0     0      0     0  0     0
  ALOX5     0   0      0      0      0      0      0      0     1     1     0     0      0     0  0     0
  ALPPL2    0   0      0      0      0      0      0      0     1     1     0     0      0     0  0     0
  AMY2A     0   0      0      0      0      0      0      0     0     0     1     0      0     0  0     0
  AR        0   0      1      1      1      1      1      1     0     0     0     0      0     0  0     0
  ASIC1     0   0      0      0      0      0      0      0     0     0     0     0      0     0  0     0

Hope this help, this is probably not the best way to do it though. 希望有帮助,但这可能不是最好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM