I'm attempting to pre-process three sets of data relating to microarray experiments. Each of the data sets are from CSV files, and are translation tables for gene data. There is a common column (foreign key), GeneID, present in all 3 data frames. It is assumed - but not confirmed - that each GeneID value is present in all of the data files.
An example from the data:
Data 1: Data 2: Data 3:
ID GeneID ; HID GeneID ; SNP_locusID GeneID
rs243 7093 ; 3 34 ; rs852 10151
rs790 3778 ; 3 11364 ; rs853 10151
rs791 3778 ; 5 37 ; rs854 10151
rs818 7093 ; 5 11370 ; rs856 10151
rs855 10151 ; 6 38 ; rs872 10539
rs856 10151 ; 10 10151 ; rs907 221037
rs907 221037 ; 7 90 ; rs916 55747
rs916 55747 ; 7 10151 ; rs916 387680
rs916 387680 ; 9 6442 ; rs941 414308
rs941 414308 ; 9 20391 ; rs778 55747
There are potentially many-to-many, one-to-many or many-to-one relations between GeneID, HID and SNP_locusID. The largest of the CSV files has circa 1,000,000 rows but execution speed is not a critical consideration here.
In order to be able to select an appropriate way to deal with the duplicated values, I'm trying to create a single, comprehensive table that shows each GeneID with it's corresponding ID, HID and SNP_locusID values- ie
GeneID ID HID SNP_locusID
10151 rs855 10 rs852
10151 rs856 7 rs853
10151 NA NA rs854
etc. The next step would then to find each duplicated value of the GeneID, and remove the duplicated rows in order to have a single unique GeneID per row.
I've tried using sqldf
, but it doesn't seem to support a full outer join, which I'm assuming is what I'll need to create the desired output (my SQL knowledge is very basic, so advice appreciated!). I've also tried analysing each data file individually first, to find the duplicate GeneIDs, via
data1[duplicated(data1[, 'GeneID']),]
and then try and merge the data sets. but am not sure whether this is the best approach to consolidating the GeneIDs down to a single GeneID per row?
EDIT: thanks Martin and Hans - here's the results of dput... Data 1 output now corrected, also.
> dput(data1)
structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 8L, 9L), .Label = c("rs243", "rs790", "rs791", "rs818",
"rs855", "rs856", "rs907", "rs916", "rs941"), class = "factor"),
GeneID = c(7093L, 3778L, 3778L, 7093L, 10151L, 10151L, 221037L,
55747L, 387680L, 414308L)), .Names = c("ID", "GeneID"
), class = "data.frame", row.names = c(NA, -10L))
> dput(data2)
structure(list(HID = c(3L, 3L, 5L, 5L, 6L, 10L, 7L, 7L, 9L, 9L
), GeneID = c(34L, 11364L, 37L, 11370L, 38L, 10151L, 90L, 10151L,
6442L, 20391L)), .Names = c("HID", "GeneID"), class = "data.frame", row.names = c(NA,
-10L))
> dput(data3)
structure(list(SNP_locusID = structure(c(2L, 3L, 4L, 5L, 6L,
7L, 8L, 8L, 9L, 1L), .Label = c("rs778", "rs852", "rs853", "rs854",
"rs856", "rs872", "rs907", "rs916", "rs941"), class = "factor"),
GeneID = c(10151L, 10151L, 10151L, 10151L, 10539L, 221037L,
55747L, 387680L, 414308L, 55747L)), .Names = c("SNP_locusID",
"GeneID"), class = "data.frame", row.names = c(NA, -10L))
I think you can use plyr::join
, which is pretty quick:
require(plyr)
all_genes <- join(data1, data2, by = "GeneID", type = "full")
all_genes <- join(all_genes, data3, by = "GeneID", type = "full")
> all_genes
ID GeneID HID SNP_locusID
1 rs243 7093 NA <NA>
2 rs790 3778 NA <NA>
3 rs791 3778 NA <NA>
4 rs818 7093 NA <NA>
5 rs855 10151 10 rs852
6 rs855 10151 10 rs853
7 rs855 10151 10 rs854
8 rs855 10151 10 rs856
9 rs855 10151 7 rs852
10 rs855 10151 7 rs853
11 rs855 10151 7 rs854
12 rs855 10151 7 rs856
13 rs856 10151 10 rs852
14 rs856 10151 10 rs853
15 rs856 10151 10 rs854
16 rs856 10151 10 rs856
17 rs856 10151 7 rs852
18 rs856 10151 7 rs853
19 rs856 10151 7 rs854
20 rs856 10151 7 rs856
21 rs907 221037 NA rs907
22 rs916 55747 NA rs916
23 rs916 55747 NA rs778
24 rs916 387680 NA rs916
25 rs941 414308 NA rs941
26 <NA> 34 3 <NA>
27 <NA> 11364 3 <NA>
28 <NA> 37 5 <NA>
29 <NA> 11370 5 <NA>
30 <NA> 38 6 <NA>
31 <NA> 90 7 <NA>
32 <NA> 6442 9 <NA>
33 <NA> 20391 9 <NA>
34 <NA> 10539 NA rs872
I didn't deal with duplicates, as it's not clear which one you want to keep. If you just want to keep the first one, pre-process much as you suggested yourself, data1 <- data1[!duplicated(data1$GeneID), ]
, etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.