简体   繁体   English

解析维恩表以在R中创建维恩图

[英]Parsing venn table to create Venn Diagram in R

I have tables with values for Venn Diagrams which I am trying to read into R and parse in order to plot with the VennDiagram package. 我有一些带有维恩图值的表,我试图将这些值读入R并进行解析,以便使用VennDiagram软件包进行绘制。 My tables look like this: 我的表如下所示:

H3K27AC.bed H3K4ME3.bed gencode.bed Total   Name
        X   19184   gencode.bed
    X       6843    H3K4ME3.bed
    X   X   3942    H3K4ME3.bed|gencode.bed
X           5097    H3K27AC.bed
X       X   1262    H3K27AC.bed|gencode.bed
X   X       4208    H3K27AC.bed|H3K4ME3.bed
X   X   X   9222    H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can read the table in as a dataframe like this: 我可以像这样在数据框中读取表:

> venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can get the categories for the venn diagram from the table like this 我可以从这样的表中获取维恩图的类别

> venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
> venn_categories
[1] "H3K27AC.bed" "H3K4ME3.bed" "gencode.bed"

And I can even make a summary table that is a bit easier to read, like this: 我什至可以制作一个更易于阅读的摘要表,如下所示:

> venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
> venn_summary
  Total                                Name
1 19184                         gencode.bed
2  6843                         H3K4ME3.bed
3  3942             H3K4ME3.bed|gencode.bed
4  5097                         H3K27AC.bed
5  1262             H3K27AC.bed|gencode.bed
6  4208             H3K27AC.bed|H3K4ME3.bed
7  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

But what is stumping me is how to get the values out of the table and assign them correctly to the areas for the venn diagram. 但是让我感到困扰的是如何从表中获取值并将它们正确地分配给维恩图的区域。 For reference, the triple venn function looks like this: 作为参考,三重维恩函数如下所示:

n1<-5097
n2<-6843
n3<-19184

n12<-4208
n13<-1262
n23<-3942

n123<-9222

venn <-draw.triple.venn(area1=n1+n12+n13+n123,
                        area2=n2+n23+n12+n123,
                        area3=n3+n23+n13+n123,
                        n12=n12+n123,
                        n13=n13+n123,
                        n23=n23+n123,
                        n123=n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

But obviously this requires setting the values manually, which is not desirable since I have many of these data sets, and also need to scale it up to 4-way and 5-way Venn's. 但这显然需要手动设置值,这是不理想的,因为我拥有许多这些数据集,并且还需要将其扩展到4路和5路Venn。 How can I get R to find the correct values for each field in the venn? 如何获得R为维恩中的每个字段找到正确的值? I have tried multiple different methods using grep , grepl , and subsetting the dataframe for the rows that match the categories for each area of the plot, but this has not worked correctly. 我已经尝试使用grepgrepl和将子帧的数据grepl为与图的每个区域相匹配的行的多种方法,但是这种方法无法正常工作。 Any suggestions? 有什么建议么? BTW this data is output from the HOMER software package's mergePeaks program. 顺便说一句,此数据是从HOMER软件包的mergePeaks程序输出的。

I think I figured it out, using regular expressions to search the table for the correct entries for the plot. 我想我已经解决了,使用正则表达式在表格中搜索情节的正确条目。 Here is the full workflow: 这是完整的工作流程:

# load packages
library('VennDiagram')
library('gridExtra')

# read in the venn text
venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
venn_table_df

looks like this: 看起来像这样:

> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

> # recreate it with this btw
> dput(venn_table_df)
structure(list(H3K27AC.bed = c("", "", "", "X", "X", "X", "X"
), H3K4ME3.bed = c("", "X", "X", "", "", "X", "X"), gencode.bed = c("X", 
"", "X", "", "X", "", "X"), Total = c(19184L, 6843L, 3942L, 5097L, 
1262L, 4208L, 9222L), Name = c("gencode.bed", "H3K4ME3.bed", 
"H3K4ME3.bed|gencode.bed", "H3K27AC.bed", "H3K27AC.bed|gencode.bed", 
"H3K27AC.bed|H3K4ME3.bed", "H3K27AC.bed|H3K4ME3.bed|gencode.bed"
)), .Names = c("H3K27AC.bed", "H3K4ME3.bed", "gencode.bed", "Total", 
"Name"), class = "data.frame", row.names = c(NA, -7L))

Then parse the table 然后解析表

# get the venn categories
venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 


# make a summary table
venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
venn_summary

# get the areas for the venn; add up all the overlaps that contain the given category 

# area1
area_n1<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area2
area_n2<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area3
area_n3<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n12
area_n12<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n13
area_n13<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n23
area_n23<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


# n123
area_n123<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


venn <-draw.triple.venn(area1=area_n1,
                        area2=area_n2,
                        area3=area_n3,
                        n12=area_n12,
                        n13=area_n13,
                        n23=area_n23,
                        n123=area_n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

The key was to use regular expressions to get only the table entries that include all of the categories for the venn area. 关键是使用正则表达式仅获取包含维恩区域所有类别的表条目。 This is a little more involved than I was hoping for, and will require manual setup to adapt to the four-way and five-way venns, but it works so far. 这比我期望的要复杂得多,并且需要手动设置以适应四向和五向静脉,但是到目前为止,它仍然有效。 I am open to other suggestions that might be able to simplify the process and scale up easier. 我对其他建议持开放态度,这些建议可能能够简化流程并更容易地扩大规模。

In case someone finds this useful, there is now a very straightforward procedure to get these numbers into an approximately proportional Venn diagram. 万一有人觉得这有用,现在有一个非常简单的过程可以将这些数字转换成近似成比例的维恩图。 One of the ways to create a diagram with the nVennR package is from scratch. 使用nVennR软件包创建图表的方法之一是从头开始。 As explained in the vignette , the values for each region are entered in a particular order, which happens to be the same as in your table. 小插图中所述,每个区域的值以特定顺序输入,恰好与表中的相同。 The only difference is that nVennR expects one more value at the beginning, corresponding to the external region (this value should be 0, but anyway it will be ignored). 唯一的区别是nVennR期望在开始时再加上一个值,该值对应于外部区域(该值应为0,但无论如何它将被忽略)。 This makes the procedure very easy: 这使过程非常容易:

> vt <- read.table('clipboard', header = T)
> vt
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1           0           0           X 19184                         gencode.bed
2           0           X           0  6843                         H3K4ME3.bed
3           0           X           X  3942             H3K4ME3.bed|gencode.bed
4           X           0           0  5097                         H3K27AC.bed
5           X           0           X  1262             H3K27AC.bed|gencode.bed
6           X           X           0  4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed
> myV <- createVennObj(nSets = 3, sNames = c('H3K27Ac', 'H3K4ME3', 'gencode'), sSizes = c(0, vt$Total))
> vp <- plotVenn(nVennObj = myV)

And the result: 结果: 结果 Another advantage of this procedure is that it is scalable to a larger number of groups. 此过程的另一个优点是,它可以扩展到更多的组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM