I am working with DNA methylation data from a microarray. Each 'probe' in the array has multiple genes associated with it, There are also multiple probes in each gene. Here is a short example:
|probe | P.Value| adj.P.Val| Dbeta|UCSC_REFGENE_NAME |
|:----------|-------:|---------:|----------:|:--------------------------|
|cg23516680 | 2e-07| 0.0003419| -0.0172609|LYST |
|cg02390624 | 2e-07| 0.0003419| 0.0170831|SYTL2;SYTL2;SYTL2 |
|cg08808720 | 2e-07| 0.0003424| -0.0129818|KIF5C;MIR1978 |
|cg12074090 | 2e-07| 0.0003300| -0.0169523|ANGPT2;ANGPT2;ANGPT2;MCPH1 |
|cg10376100 | 1e-07| 0.0002714| 0.0172562|LYST;MIR1537 |
What I'd like to do is make groups according to ANY of the character names (genes) that appear in the UCSC_REFGENE_NAME column (eg one group would be all probes associated with the gene LYST , and another all in MIR1537 )
Points:
Suggestions?
Expanding on @thelatemail's comment, you can use tidyr::separate_rows
to create one row for each individual entry in the UCSC_REFGENE_NAME
column. Next you can remove the duplicate entries with dplyr::distinct
.
library(dplyr)
library(tidyr)
df %>%
separate_rows(UCSC_REFGENE_NAME, sep = ";") %>%
distinct()
#> probe P.Value adj.P.Val Dbeta UCSC_REFGENE_NAME
#> 1 cg23516680 2e-07 0.0003419 -0.0172609 LYST
#> 2 cg02390624 2e-07 0.0003419 0.0170831 SYTL2
#> 3 cg08808720 2e-07 0.0003424 -0.0129818 KIF5C
#> 4 cg08808720 2e-07 0.0003424 -0.0129818 MIR1978
#> 5 cg12074090 2e-07 0.0003300 -0.0169523 ANGPT2
#> 6 cg12074090 2e-07 0.0003300 -0.0169523 MCPH1
#> 7 cg10376100 1e-07 0.0002714 0.0172562 LYST
#> 8 cg10376100 1e-07 0.0002714 0.0172562 MIR1537
Data used
txt = " |probe | P.Value| adj.P.Val| Dbeta|UCSC_REFGENE_NAME |
|cg23516680 | 2e-07| 0.0003419| -0.0172609|LYST |
|cg02390624 | 2e-07| 0.0003419| 0.0170831|SYTL2;SYTL2;SYTL2 |
|cg08808720 | 2e-07| 0.0003424| -0.0129818|KIF5C;MIR1978 |
|cg12074090 | 2e-07| 0.0003300| -0.0169523|ANGPT2;ANGPT2;ANGPT2;MCPH1 |
|cg10376100 | 1e-07| 0.0002714| 0.0172562|LYST;MIR1537 |"
df <- read.table(text = stringr::str_replace_all(txt, "\\|", " "),
header = TRUE, stringsAsFactors = FALSE)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.