简体   繁体   English

基因名称重复时,如何从R中的RNAseq数据通过基因ID调用数据帧?

[英]How do I call a data frame by Gene IDs from RNAseq data in R when gene names are duplicated?

I have a .csv file that contains gene names in the 1st column and "Transcripts per million" counts for each gene per patient in subsequent columns. 我有一个.csv文件,该文件在第一列中包含基因名称,在随后的列中每个患者的每个基因的“每百万转录本”计数。 There are 56,632 genes that were read and it appears there are a number of duplicate gene IDs. 已读取了56,632个基因,并且看来有许多重复的基因ID。 A sample of my data matrix is below: 我的数据矩阵示例如下:

Gene_ID     UniIEC01    UniIEC02    UniIEC03    UniIEC04    UniIEC05
TSPAN6        1.45        1.30        1.53        1.35        1.50
TNMD         -2.00       -2.00       -2.00       -2.00        0.29
DPM1          0.76        1.06        1.37        0.90        1.26
SCYL3        -0.43        0.67        0.43        0.71        0.23
C1orf112     -0.43        0.18        0.14        0.74        0.06
FGR          -2.00       -2.00       -2.00        0.29       -2.00
CFH          -2.00       -0.92       -2.00       -0.42       -2.00

I tried the following things for "read.table" and had the following issues: 我为“ read.table”尝试了以下操作,并遇到以下问题:

(1) Manually added a column with numbers as "row.names" and assigned "row.names" to that column. (1)手动添加编号为“ row.names”的列,并为该列分配“ row.names”。 PROBLEM: Am then unable to call data by gene name. 问题:然后无法按基因名称调用数据。 I have some lists of 200+ genes that I'd like to call and it will be too labored to find the row number for EACH of these. 我有一些200多个基因的清单,我想打电话给他们,要找到这些清单的每一行都太费力了。 (2) When reading in the table, I set "row.names= NULL" which had the proper format. (2)在读取表时,我设置了具有正确格式的“ row.names = NULL”。 PROBLEM: When I tried to a practice of calling data using 问题:当我尝试使用以下方式调用数据时

"data.frame["TSPAN6":"TNMD",1:5] 

I got the error message: "NAs introduced by coercion" and all of the cells except for patient number come back as "NA". 我收到错误消息:“ NAs由强制引入”,除患者编号以外的所有单元格都返回为“ NA”。

Can anyone help me with this problem please? 有人可以帮我解决这个问题吗?

My ultimate goal is to create a heatmap using specific sets of genes out of the 56,632. 我的最终目标是使用56,632个基因中的特定基因来创建热图。

Thank you! 谢谢!

Avantika Avantika

You can get your desired genes via: 您可以通过以下方式获得所需的基因:

gene_list <- c('CNTF', 'CFH', 'TSPAN6')
df[df$Gene_ID %in% gene_list, ]

heatmap.2() from the gplots package is one of the more popular ways of making heatmaps. heatmap.2()gplots包是使热图的比较流行的方法之一。

All that being said, you should probably go back and figure out why you have duplicated gene names. 话虽如此,您可能应该回过头来找出为什么重复基因名称的原因。 I'm guessing it's multiple isoforms per gene. 我猜每个基因有多个同种型。 In that case you need to recompute your transcripts per million from the raw counts if you want to quantify at the gene level. 在这种情况下,如果要在基因水平上进行定量,则需要从原始计数中重新计算每百万的笔录。 But this problem is not for stack overflow. 但这不是堆栈溢出的问题。 Try biostars.org to ask how to recompute these values. 尝试biostars.org询问如何重新计算这些值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM