简体   繁体   English

在 r 中为热图准备数据帧 ggplot2

[英]preparing data frame in r for heatmap with ggplot2

Currently trying to create a heatmap of some genetic data.目前正在尝试创建一些遗传数据的热图。 The columns are currently labeled s1, s2, s3, etc., but I also have a.txt file that has the correct corresponding labels for each sample.这些列当前标记为 s1、s2、s3 等,但我也有一个 .txt 文件,该文件具有每个样本的正确对应标签。 I'm not sure if I need to first modify the csv file with the levels of gene expression or if I can transfer them separately to the data frame I'm trying to prepare that will eventually be made into a heatmap.我不确定我是否需要首先修改 csv 文件的基因表达水平,或者我是否可以将它们单独传输到我正在尝试准备的数据框,最终将其制作成热图。 I'm also not sure exactly what the format of the dataframe should be.我也不确定 dataframe 的格式应该是什么。 I would like to use ggplot2 to create the heatmap if that matters.如果这很重要,我想使用 ggplot2 创建热图。

Here's my code so far:到目前为止,这是我的代码:

library(ggplot2)
library(dplyr)
library(magrittr) 

nci <- read.csv('/Users/myname/Desktop/ML Extra Credit/nci.data.csv')
nci.label <-scan(url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/nci.label",what="")
                 
#Select certain columns (specific years)
mat <- matrix(rexp(200, rate=.1), ncol=20)
rownames(mat) <- paste0('gene',1:nrow(mat))
colnames(mat) <- paste0('sample',1:ncol(mat))
mat[1:5,1:5]

It outputs a sample data frame that looks like this:它输出一个示例数据框,如下所示:

    sample1   sample2    sample3   sample4   sample5

gene1 32.278434 16.678512  0.4637713  1.016569  3.353944

gene2  8.719729 11.080337  1.5254223  2.392519  3.503191

gene3  2.199697 18.846487 13.6525699 34.963664  2.511097

gene4  5.860673  2.160185  3.5243884  6.785453  3.947606

gene5 16.363688 38.543575  5.6761373 10.142018 22.481752

Any help would be greatly appreciated!!任何帮助将不胜感激!!

You will want to get your dataframe in "long" format to facilitate plotting.您需要以“长”格式获取 dataframe 以方便绘图。 This is what's called Tidy Data and forms the basis for preparing data to be plotted using ggplot2 .这就是所谓的整洁数据和 forms 是准备使用ggplot2绘制数据的基础。

The general idea here is that you need one column for the x value, one column for the y value, and one column to represent the value used for the tile color.这里的一般想法是,您需要一列用于x值,一列用于y值,以及一列表示用于平铺颜色的值。 There are lots of ways to do this (see melt() , pivot_longer() ...), but I like to use tidyr::gather() .有很多方法可以做到这一点(参见melt()pivot_longer() ...),但我喜欢使用tidyr::gather() Since you're using rownames, instead of a column for gene, I'm first creating that as a column in your dataset.由于您使用的是行名,而不是基因列,因此我首先将其创建为数据集中的列。

library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(1234)

# create matrix
mat <- matrix(rexp(200, rate=.1), ncol=20)
rownames(mat) <- paste0('gene',1:nrow(mat))
colnames(mat) <- paste0('sample',1:ncol(mat))
mat[1:5,1:5]

# convert to data.frame and gather
mat <- as.data.frame(mat)
mat$gene <- rownames(mat)
mat <- mat %>% gather(key='sample', value='value', -gene)

The ggplot call is pretty easy. ggplot调用非常简单。 We assign each column to x , y , and fill aesthetics, then use geom_tile() to create the actual heatmap.我们将每一列分配给xyfill美学,然后使用geom_tile()创建实际的热图。

ggplot(mat, aes(sample, gene)) + geom_tile(aes(fill=value))

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM