简体   繁体   中英

Joining 3 Data Sets to Analyse Duplicated Rows

I'm attempting to pre-process three sets of data relating to microarray experiments. Each of the data sets are from CSV files, and are translation tables for gene data. There is a common column (foreign key), GeneID, present in all 3 data frames. It is assumed - but not confirmed - that each GeneID value is present in all of the data files.

An example from the data:

 Data 1:                   Data 2:               Data 3:
 ID           GeneID  ;    HID     GeneID    ;   SNP_locusID    GeneID
 rs243        7093    ;    3       34        ;   rs852          10151
 rs790        3778    ;    3       11364     ;   rs853          10151
 rs791        3778    ;    5       37        ;   rs854          10151
 rs818        7093    ;    5       11370     ;   rs856          10151
 rs855        10151   ;    6       38        ;   rs872          10539
 rs856        10151   ;    10      10151     ;   rs907          221037
 rs907        221037  ;    7       90        ;   rs916          55747
 rs916        55747   ;    7       10151     ;   rs916          387680
 rs916        387680  ;    9       6442      ;   rs941          414308
 rs941        414308  ;    9       20391     ;   rs778          55747

There are potentially many-to-many, one-to-many or many-to-one relations between GeneID, HID and SNP_locusID. The largest of the CSV files has circa 1,000,000 rows but execution speed is not a critical consideration here.

In order to be able to select an appropriate way to deal with the duplicated values, I'm trying to create a single, comprehensive table that shows each GeneID with it's corresponding ID, HID and SNP_locusID values- ie

GeneID         ID         HID         SNP_locusID
10151          rs855      10          rs852
10151          rs856      7           rs853
10151          NA         NA          rs854

etc. The next step would then to find each duplicated value of the GeneID, and remove the duplicated rows in order to have a single unique GeneID per row.

I've tried using sqldf , but it doesn't seem to support a full outer join, which I'm assuming is what I'll need to create the desired output (my SQL knowledge is very basic, so advice appreciated!). I've also tried analysing each data file individually first, to find the duplicate GeneIDs, via

data1[duplicated(data1[, 'GeneID']),]

and then try and merge the data sets. but am not sure whether this is the best approach to consolidating the GeneIDs down to a single GeneID per row?

EDIT: thanks Martin and Hans - here's the results of dput... Data 1 output now corrected, also.

> dput(data1)
structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 8L, 9L), .Label = c("rs243", "rs790", "rs791", "rs818", 
"rs855", "rs856", "rs907", "rs916", "rs941"), class = "factor"), 
GeneID = c(7093L, 3778L, 3778L, 7093L, 10151L, 10151L, 221037L, 
55747L, 387680L, 414308L)), .Names = c("ID", "GeneID"
), class = "data.frame", row.names = c(NA, -10L))

> dput(data2)
structure(list(HID = c(3L, 3L, 5L, 5L, 6L, 10L, 7L, 7L, 9L, 9L
), GeneID = c(34L, 11364L, 37L, 11370L, 38L, 10151L, 90L, 10151L, 
6442L, 20391L)), .Names = c("HID", "GeneID"), class = "data.frame", row.names = c(NA, 
-10L))

> dput(data3)
structure(list(SNP_locusID = structure(c(2L, 3L, 4L, 5L, 6L, 
7L, 8L, 8L, 9L, 1L), .Label = c("rs778", "rs852", "rs853", "rs854", 
"rs856", "rs872", "rs907", "rs916", "rs941"), class = "factor"), 
GeneID = c(10151L, 10151L, 10151L, 10151L, 10539L, 221037L, 
55747L, 387680L, 414308L, 55747L)), .Names = c("SNP_locusID", 
"GeneID"), class = "data.frame", row.names = c(NA, -10L))

I think you can use plyr::join , which is pretty quick:

require(plyr)
all_genes <- join(data1, data2, by = "GeneID", type = "full")
all_genes <- join(all_genes, data3, by = "GeneID", type = "full")

> all_genes
      ID GeneID HID SNP_locusID
1  rs243   7093  NA        <NA>
2  rs790   3778  NA        <NA>
3  rs791   3778  NA        <NA>
4  rs818   7093  NA        <NA>
5  rs855  10151  10       rs852
6  rs855  10151  10       rs853
7  rs855  10151  10       rs854
8  rs855  10151  10       rs856
9  rs855  10151   7       rs852
10 rs855  10151   7       rs853
11 rs855  10151   7       rs854
12 rs855  10151   7       rs856
13 rs856  10151  10       rs852
14 rs856  10151  10       rs853
15 rs856  10151  10       rs854
16 rs856  10151  10       rs856
17 rs856  10151   7       rs852
18 rs856  10151   7       rs853
19 rs856  10151   7       rs854
20 rs856  10151   7       rs856
21 rs907 221037  NA       rs907
22 rs916  55747  NA       rs916
23 rs916  55747  NA       rs778
24 rs916 387680  NA       rs916
25 rs941 414308  NA       rs941
26  <NA>     34   3        <NA>
27  <NA>  11364   3        <NA>
28  <NA>     37   5        <NA>
29  <NA>  11370   5        <NA>
30  <NA>     38   6        <NA>
31  <NA>     90   7        <NA>
32  <NA>   6442   9        <NA>
33  <NA>  20391   9        <NA>
34  <NA>  10539  NA       rs872

I didn't deal with duplicates, as it's not clear which one you want to keep. If you just want to keep the first one, pre-process much as you suggested yourself, data1 <- data1[!duplicated(data1$GeneID), ] , etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM