简体   繁体   中英

Parsing venn table to create Venn Diagram in R

I have tables with values for Venn Diagrams which I am trying to read into R and parse in order to plot with the VennDiagram package. My tables look like this:

H3K27AC.bed H3K4ME3.bed gencode.bed Total   Name
        X   19184   gencode.bed
    X       6843    H3K4ME3.bed
    X   X   3942    H3K4ME3.bed|gencode.bed
X           5097    H3K27AC.bed
X       X   1262    H3K27AC.bed|gencode.bed
X   X       4208    H3K27AC.bed|H3K4ME3.bed
X   X   X   9222    H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can read the table in as a dataframe like this:

> venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can get the categories for the venn diagram from the table like this

> venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
> venn_categories
[1] "H3K27AC.bed" "H3K4ME3.bed" "gencode.bed"

And I can even make a summary table that is a bit easier to read, like this:

> venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
> venn_summary
  Total                                Name
1 19184                         gencode.bed
2  6843                         H3K4ME3.bed
3  3942             H3K4ME3.bed|gencode.bed
4  5097                         H3K27AC.bed
5  1262             H3K27AC.bed|gencode.bed
6  4208             H3K27AC.bed|H3K4ME3.bed
7  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

But what is stumping me is how to get the values out of the table and assign them correctly to the areas for the venn diagram. For reference, the triple venn function looks like this:

n1<-5097
n2<-6843
n3<-19184

n12<-4208
n13<-1262
n23<-3942

n123<-9222

venn <-draw.triple.venn(area1=n1+n12+n13+n123,
                        area2=n2+n23+n12+n123,
                        area3=n3+n23+n13+n123,
                        n12=n12+n123,
                        n13=n13+n123,
                        n23=n23+n123,
                        n123=n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

But obviously this requires setting the values manually, which is not desirable since I have many of these data sets, and also need to scale it up to 4-way and 5-way Venn's. How can I get R to find the correct values for each field in the venn? I have tried multiple different methods using grep , grepl , and subsetting the dataframe for the rows that match the categories for each area of the plot, but this has not worked correctly. Any suggestions? BTW this data is output from the HOMER software package's mergePeaks program.

I think I figured it out, using regular expressions to search the table for the correct entries for the plot. Here is the full workflow:

# load packages
library('VennDiagram')
library('gridExtra')

# read in the venn text
venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
venn_table_df

looks like this:

> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

> # recreate it with this btw
> dput(venn_table_df)
structure(list(H3K27AC.bed = c("", "", "", "X", "X", "X", "X"
), H3K4ME3.bed = c("", "X", "X", "", "", "X", "X"), gencode.bed = c("X", 
"", "X", "", "X", "", "X"), Total = c(19184L, 6843L, 3942L, 5097L, 
1262L, 4208L, 9222L), Name = c("gencode.bed", "H3K4ME3.bed", 
"H3K4ME3.bed|gencode.bed", "H3K27AC.bed", "H3K27AC.bed|gencode.bed", 
"H3K27AC.bed|H3K4ME3.bed", "H3K27AC.bed|H3K4ME3.bed|gencode.bed"
)), .Names = c("H3K27AC.bed", "H3K4ME3.bed", "gencode.bed", "Total", 
"Name"), class = "data.frame", row.names = c(NA, -7L))

Then parse the table

# get the venn categories
venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 


# make a summary table
venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
venn_summary

# get the areas for the venn; add up all the overlaps that contain the given category 

# area1
area_n1<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area2
area_n2<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area3
area_n3<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n12
area_n12<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n13
area_n13<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n23
area_n23<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


# n123
area_n123<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


venn <-draw.triple.venn(area1=area_n1,
                        area2=area_n2,
                        area3=area_n3,
                        n12=area_n12,
                        n13=area_n13,
                        n23=area_n23,
                        n123=area_n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

The key was to use regular expressions to get only the table entries that include all of the categories for the venn area. This is a little more involved than I was hoping for, and will require manual setup to adapt to the four-way and five-way venns, but it works so far. I am open to other suggestions that might be able to simplify the process and scale up easier.

In case someone finds this useful, there is now a very straightforward procedure to get these numbers into an approximately proportional Venn diagram. One of the ways to create a diagram with the nVennR package is from scratch. As explained in the vignette , the values for each region are entered in a particular order, which happens to be the same as in your table. The only difference is that nVennR expects one more value at the beginning, corresponding to the external region (this value should be 0, but anyway it will be ignored). This makes the procedure very easy:

> vt <- read.table('clipboard', header = T)
> vt
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1           0           0           X 19184                         gencode.bed
2           0           X           0  6843                         H3K4ME3.bed
3           0           X           X  3942             H3K4ME3.bed|gencode.bed
4           X           0           0  5097                         H3K27AC.bed
5           X           0           X  1262             H3K27AC.bed|gencode.bed
6           X           X           0  4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed
> myV <- createVennObj(nSets = 3, sNames = c('H3K27Ac', 'H3K4ME3', 'gencode'), sSizes = c(0, vt$Total))
> vp <- plotVenn(nVennObj = myV)

And the result: 结果 Another advantage of this procedure is that it is scalable to a larger number of groups.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM