繁体   English   中英

有没有办法在 R 中的两个数据帧之间生成 plot 相关热图? 这两个数据框具有不同的行名并且维度不相等

[英]Is there a way to plot correlation heatmap between two dataframes in R? The two dataframes have different row names and are of unequal dimesions

我有两个不同的数据框,如附图所示。 数据框 1数据框 2

这就是我尝试过的。

#First dataframe
structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
"Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
"Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
"Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
"Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
"Gene 29", "Gene 30"), Count = c(1500, 1600, 1700, 1800, 1900, 
2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 
3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 
4200, 4300, 4400)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-30L))

df_1 <- read_excel("Demo_data.xlsx", sheet = "Dataframe1")
str(df_1)
View(df_1)

df_1.1 <- column_to_rownames(df_1, 'Label')
View(df_1.1)

df_1.2 <- t(df_1.1)
View(df_1.2)

df_1.2 <- as.data.frame(df_1.2)
str(df_1.2)


typeof(dff1)
str(dff1)


#Second dataframe
structure(list(Label = c("Control1", "Control2", "Control3", 
"Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
"Control10", "Control11", "Control12", "Control13", "Control14", 
"Control15", "Control16", "Control17", "Control18", "Control19", 
"Control20", "Control21", "Control22", "Control23", "Control24"
), Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 
2300, 700, 1400, 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 
4300, 2120, 4800, 4300)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -24L))


df_2 <- read_excel("Demo_data.xlsx", sheet = "Dataframe2")

df_2.1 <- column_to_rownames(df_2, 'Label')
View(df_2.1)

df_2.1 <- t(df_2.1)
View(df_2.1)

df_2.1 <- as.data.frame(df_2.1)
str(df_2.1)

correlation <- cor(df_1.2, df_2.1)
View(correlation)

这是我想要的 output,但我得到的每个相关性都为 NA。 非常感谢任何帮助。

所需 output(无 NA)

正如评论中所写,您要实现的目标尚不清楚。

如果要计算每个 dataframe 中Count列之间的相关性并使用散点图将其可视化,可以使用以下代码:

library(tidyverse)

df_1 <- structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
                                 "Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
                                 "Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
                                 "Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
                                 "Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
                                 "Gene 29", "Gene 30"), 
                       Count = c(1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 
                                 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 
                                 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))


df_2 <- structure(list(Label = c("Control1", "Control2", "Control3", 
                                 "Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
                                 "Control10", "Control11", "Control12", "Control13", "Control14", 
                                 "Control15", "Control16", "Control17", "Control18", "Control19", 
                                 "Control20", "Control21", "Control22", "Control23", "Control24"), 
                       Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 2300, 700, 1400, 
                                 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 4300, 2120, 4800, 4300)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L))

dat = left_join(
  df_1 %>% mutate(id=str_extract(Label, "\\d+")),
  df_2 %>% mutate(id=str_extract(Label, "\\d+")), 
  by="id", suffix=c("_gene", "_ctl")
)

dat
#> # A tibble: 30 x 5
#>    Label_gene Count_gene id    Label_ctl Count_ctl
#>    <chr>           <dbl> <chr> <chr>         <dbl>
#>  1 Gene 1           1500 1     Control1       1800
#>  2 Gene 2           1600 2     Control2       1400
#>  3 Gene 3           1700 3     Control3       1110
#>  4 Gene 4           1800 4     Control4       1900
#>  5 Gene 5           1900 5     Control5       2500
#>  6 Gene 6           2000 6     Control6       2900
#>  7 Gene 7           2100 7     Control7       2100
#>  8 Gene 8           2200 8     Control8        900
#>  9 Gene 9           2300 9     Control9       5000
#> 10 Gene 10          2400 10    Control10      2300
#> # ... with 20 more rows

cor(dat$Count_gene, dat$Count_ctl, use="pairwise.complete.obs")
#> [1] 0.5047392

ggplot(dat, aes(x=Count_gene, y=Count_ctl)) + 
  geom_point()
#> Warning: Removed 6 rows containing missing values (`geom_point()`).

创建于 2022-12-12,使用reprex v2.0.2

基本上,我将 id 提取为 label 的最后一位数字,然后使用left_join()合并数据帧。

这可能看起来过于复杂,但在一个 dataframe 中保持数据整洁始终是个好主意。

请注意,在您的示例中, df_2停在id==24处,因此仅根据 24 个完整观察值计算相关性。

但是,相关性是跨 2 个向量计算的,因此为了获得热图,您需要一组许多向量,而您似乎没有。

对于你的下一个问题,如果你像我在这个答案中所做的那样使用reprex package 那就太好了。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM