简体   繁体   中英

Converting data.frame into a matrix for expression data

I'm new to R, but I'm getting dangerous. I want to make a massive gene expression line chart from about 2000 genes that were monitored after drug treatment. My dataframe after loading via csv looks like this
:

head(tmp)
  gene_symbol   untreated   X1hr.avg   X3hr.avg    X6hr.avg  X24hr.avg
1      ERRFI1  0.16612478 -2.0758630 -2.5892085 -2.02039809 -2.4124696
2      ERRFI1  0.27750147 -2.3086333 -3.0538376 -4.01436186 -4.7491462
3     CTDSPL2  0.13172411 -0.7920983 -0.3580963 -0.76213664 -0.8171385
4     CTDSPL2 -0.05205203 -0.9551288 -0.2072265 -0.76993891 -1.0028680
5     SLC26A2  0.20268100  0.5188266  0.5429924  0.01970562 -1.1955852
6     SLC29A4  0.19658238 -0.8102461 -0.9019243 -1.50714838 -1.4648872

I would like to transform this dataframe into something like this:

gene_symbol  ratio       treatment
ERRFI1       0.16612478  untreated
ERRFI1       -2.0758630  X1hr.avg 
ERRFI1       -2.5892085  X3hr.avg
ERRFI1       -2.02039809 X6hr.avg
ERRFI1       -2.4124696  X24hr.avg

etc...

This would allow me to plot via ggplot:

ggplot(data=tmp, aes(x=factor(treatment), y=ratio, group=gene_symbol)) + geom_line() + geom_point()

What you're looking for is the melt() function from the reshape2 library. I used your variable names, but I would suggest storing the melted data into a different variable name.

tmp <- as.data.frame(read.table(text="gene_symbol   untreated   X1hr.avg   X3hr.avg    X6hr.avg  X24hr.avg
                            1      ERRFI1  0.16612478 -2.0758630 -2.5892085 -2.02039809 -2.4124696
                            2      ERRFI1  0.27750147 -2.3086333 -3.0538376 -4.01436186 -4.7491462
                            3     CTDSPL2  0.13172411 -0.7920983 -0.3580963 -0.76213664 -0.8171385
                            4     CTDSPL2 -0.05205203 -0.9551288 -0.2072265 -0.76993891 -1.0028680
                            5     SLC26A2  0.20268100  0.5188266  0.5429924  0.01970562 -1.1955852
                            6     SLC29A4  0.19658238 -0.8102461 -0.9019243 -1.50714838 -1.4648872", header=TRUE))

library(reshape2)

tmp <- melt(data=tmp, id.vars=c("gene_symbol"))
names(tmp) <- sub("variable", "treatment", names(tmp))
names(tmp) <- sub("value", "ratio", names(tmp))

ggplot(data=tmp, aes(x=factor(treatment), y=ratio, group=gene_symbol)) + geom_line(aes(colour=gene_symbol)) + geom_point()    

您的输出

Not sure if this is a useful way to present this type of data though. you might want to rethink what exactly your goal is.

What you're really doing is "stacking" your variables, so you can also use the ... stack function.

out <- data.frame(tmp[1], stack(tmp[-1]))

You'll get a warnings, but that is a warning, not an error. It just tells you that the output has new row names.

Here are the first and last few rows of the resulting "stacked" data.frame :

> head(out)
  gene_symbol      values       ind
1      ERRFI1  0.16612478 untreated
2      ERRFI1  0.27750147 untreated
3     CTDSPL2  0.13172411 untreated
4     CTDSPL2 -0.05205203 untreated
5     SLC26A2  0.20268100 untreated
6     SLC29A4  0.19658238 untreated
> tail(out)
   gene_symbol     values       ind
25      ERRFI1 -2.4124696 X24hr.avg
26      ERRFI1 -4.7491462 X24hr.avg
27     CTDSPL2 -0.8171385 X24hr.avg
28     CTDSPL2 -1.0028680 X24hr.avg
29     SLC26A2 -1.1955852 X24hr.avg
30     SLC29A4 -1.4648872 X24hr.avg 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM