[英]Is there a way to plot correlation heatmap between two dataframes in R? The two dataframes have different row names and are of unequal dimesions
我有两个不同的数据框,如附图所示。 数据框 1和数据框 2 。
这就是我尝试过的。
#First dataframe
structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4",
"Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10",
"Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16",
"Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22",
"Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28",
"Gene 29", "Gene 30"), Count = c(1500, 1600, 1700, 1800, 1900,
2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000,
3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100,
4200, 4300, 4400)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-30L))
df_1 <- read_excel("Demo_data.xlsx", sheet = "Dataframe1")
str(df_1)
View(df_1)
df_1.1 <- column_to_rownames(df_1, 'Label')
View(df_1.1)
df_1.2 <- t(df_1.1)
View(df_1.2)
df_1.2 <- as.data.frame(df_1.2)
str(df_1.2)
typeof(dff1)
str(dff1)
#Second dataframe
structure(list(Label = c("Control1", "Control2", "Control3",
"Control4", "Control5", "Control6", "Control7", "Control8", "Control9",
"Control10", "Control11", "Control12", "Control13", "Control14",
"Control15", "Control16", "Control17", "Control18", "Control19",
"Control20", "Control21", "Control22", "Control23", "Control24"
), Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000,
2300, 700, 1400, 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700,
4300, 2120, 4800, 4300)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -24L))
df_2 <- read_excel("Demo_data.xlsx", sheet = "Dataframe2")
df_2.1 <- column_to_rownames(df_2, 'Label')
View(df_2.1)
df_2.1 <- t(df_2.1)
View(df_2.1)
df_2.1 <- as.data.frame(df_2.1)
str(df_2.1)
correlation <- cor(df_1.2, df_2.1)
View(correlation)
这是我想要的 output,但我得到的每个相关性都为 NA。 非常感谢任何帮助。
正如评论中所写,您要实现的目标尚不清楚。
如果要计算每个 dataframe 中Count
列之间的相关性并使用散点图将其可视化,可以使用以下代码:
library(tidyverse)
df_1 <- structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4",
"Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10",
"Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16",
"Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22",
"Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28",
"Gene 29", "Gene 30"),
Count = c(1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500,
2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600,
3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))
df_2 <- structure(list(Label = c("Control1", "Control2", "Control3",
"Control4", "Control5", "Control6", "Control7", "Control8", "Control9",
"Control10", "Control11", "Control12", "Control13", "Control14",
"Control15", "Control16", "Control17", "Control18", "Control19",
"Control20", "Control21", "Control22", "Control23", "Control24"),
Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 2300, 700, 1400,
3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 4300, 2120, 4800, 4300)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L))
dat = left_join(
df_1 %>% mutate(id=str_extract(Label, "\\d+")),
df_2 %>% mutate(id=str_extract(Label, "\\d+")),
by="id", suffix=c("_gene", "_ctl")
)
dat
#> # A tibble: 30 x 5
#> Label_gene Count_gene id Label_ctl Count_ctl
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Gene 1 1500 1 Control1 1800
#> 2 Gene 2 1600 2 Control2 1400
#> 3 Gene 3 1700 3 Control3 1110
#> 4 Gene 4 1800 4 Control4 1900
#> 5 Gene 5 1900 5 Control5 2500
#> 6 Gene 6 2000 6 Control6 2900
#> 7 Gene 7 2100 7 Control7 2100
#> 8 Gene 8 2200 8 Control8 900
#> 9 Gene 9 2300 9 Control9 5000
#> 10 Gene 10 2400 10 Control10 2300
#> # ... with 20 more rows
cor(dat$Count_gene, dat$Count_ctl, use="pairwise.complete.obs")
#> [1] 0.5047392
ggplot(dat, aes(x=Count_gene, y=Count_ctl)) +
geom_point()
#> Warning: Removed 6 rows containing missing values (`geom_point()`).
创建于 2022-12-12,使用reprex v2.0.2
基本上,我将 id 提取为 label 的最后一位数字,然后使用left_join()
合并数据帧。
这可能看起来过于复杂,但在一个 dataframe 中保持数据整洁始终是个好主意。
请注意,在您的示例中, df_2
停在id==24
处,因此仅根据 24 个完整观察值计算相关性。
但是,相关性是跨 2 个向量计算的,因此为了获得热图,您需要一组许多向量,而您似乎没有。
对于你的下一个问题,如果你像我在这个答案中所做的那样使用reprex
package 那就太好了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.