How do I call a data frame by Gene IDs from RNAseq data in R when gene names are duplicated?

Question

I have a .csv file that contains gene names in the 1st column and "Transcripts per million" counts for each gene per patient in subsequent columns. There are 56,632 genes that were read and it appears there are a number of duplicate gene IDs. A sample of my data matrix is below:

Gene_ID     UniIEC01    UniIEC02    UniIEC03    UniIEC04    UniIEC05
TSPAN6        1.45        1.30        1.53        1.35        1.50
TNMD         -2.00       -2.00       -2.00       -2.00        0.29
DPM1          0.76        1.06        1.37        0.90        1.26
SCYL3        -0.43        0.67        0.43        0.71        0.23
C1orf112     -0.43        0.18        0.14        0.74        0.06
FGR          -2.00       -2.00       -2.00        0.29       -2.00
CFH          -2.00       -0.92       -2.00       -0.42       -2.00

I tried the following things for "read.table" and had the following issues:

(1) Manually added a column with numbers as "row.names" and assigned "row.names" to that column. PROBLEM: Am then unable to call data by gene name. I have some lists of 200+ genes that I'd like to call and it will be too labored to find the row number for EACH of these. (2) When reading in the table, I set "row.names= NULL" which had the proper format. PROBLEM: When I tried to a practice of calling data using

"data.frame["TSPAN6":"TNMD",1:5]

I got the error message: "NAs introduced by coercion" and all of the cells except for patient number come back as "NA".

Can anyone help me with this problem please?

My ultimate goal is to create a heatmap using specific sets of genes out of the 56,632.

Thank you!

Avantika

Answer 1

You can get your desired genes via:

gene_list <- c('CNTF', 'CFH', 'TSPAN6')
df[df$Gene_ID %in% gene_list, ]

heatmap.2() from the gplots package is one of the more popular ways of making heatmaps.

All that being said, you should probably go back and figure out why you have duplicated gene names. I'm guessing it's multiple isoforms per gene. In that case you need to recompute your transcripts per million from the raw counts if you want to quantify at the gene level. But this problem is not for stack overflow. Try biostars.org to ask how to recompute these values.

How do I call a data frame by Gene IDs from RNAseq data in R when gene names are duplicated?

Question

1 answers

solution1
0 2015-10-23 20:00:03

How do I call a data frame by Gene IDs from RNAseq data in R when gene names are duplicated?

Question

1 answers

solution1 0 2015-10-23 20:00:03

solution1
0 2015-10-23 20:00:03