简体   繁体   中英

Searching for specific genes within a large matrix, loop needs optimization

For each iteration of my code I start by finding values from a PPI network from a source of genes, which are recalculated at the end of each run. If the gene is found in either the A or B category I store both the score and gene of its completement in another location to be sorted and concatenated for further testing. The search process takes time and I am looking for anyway to optimize it. The issue is I have begun to work with 300+ genes in a batch and three days to run a calcultion is to long. Extra information the matrix is of all interaction within the PPI network totaling ~176,000X3.

The slow code:

#CREATE THE DNS LIST
    DNSList = FALSE
    DNSListNameHolder = NA
    DNSListValueHolder = NA
    DNSListHolder = 0
    CN = 0
    Prev = 0
    while(!DNSList)
    {
        CN = CN + 1
        IDNSList = FALSE
        CNN = Prev
        while(!IDNSList)
        {
            CNN = CNN + 1
            if(as.character(ForDNSList$Gene.A[CNN]) == as.character(Candidate[CN]))
            {
                DNSListHolder = DNSListHolder  + 1
                DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
                DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder]) 
                DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.B[CNN])
            }
            if(as.character(ForDNSList$Gene.B[CNN]) == as.character(Candidate[CN]))
            {
                DNSListHolder = DNSListHolder  + 1
                DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
                DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder]) 
                DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.A[CNN])
            }
            if(CNN == length(ForDNSList$Gene.A))
                IDNSList = TRUE
        }
        if(CN == length(Candidate))
            DNSList = TRUE
        print(paste("Pre-DNS List in Progress",CN/length(Candidate), sep = " "))    
    }
    print("Pre-DNS List Completed")

For Example purposes the Candidate list can be set to this

Candidate = c("BRCA1", "BRCA2", "ATK1", "FYN")

The ForDNSList is long so here is a small excerpt to get the idea of how to list looks. It is more/less random if the gene I am searching for is in gene column A or B.

> ForDNSList[1:50, 1:3]
     Gene.A     Gene.B Score
1    Q96BE0     POLR3A 0.126
2    Q96BE0      PDPK1 0.126
3    Q96BE0      MGEA5 0.126
4    Q96BE0     DNAJA2 0.126
5    Q96BE0     DNAJB6 0.126
6    Q96BE0       BAG4 0.126
7    Q96BE0     HSPA4L 0.126
8     THAP1 A0A024RA76 0.332
9    Q96BE0       BAG2 0.236
10   Q96BE0       BAG3 0.236
11   Q96BE0       EGFR 0.236
12   Q96BE0        MOS 0.126
13   Q96BE0       RAF1 0.126
14   Q96BE0     GABRB1 0.126
15   Q96BE0       GNAZ 0.126
16    MS4A7      HMGCL 0.286
17   Q96BE0     ATP5A1 0.126
18   Q96BE0     DNAJA1 0.126
19     DVL3      PPM1A 0.210
20   Q96BE0       MCM5 0.126
21   Q96BE0       MCM7 0.126
22   Q96BE0      HSPA4 0.126
23   Q96BE0      PSMC2 0.126
24   Q96BE0       GNAL 0.126
25   Q96BE0        AMT 0.126
26    MECP2      SOX18 0.286
27   Q96BE0     CSNK1E 0.126
28   Q96BE0       ST13 0.126
29  CSNK2A1       MYH9 0.454
30   Q96BE0       CDK9 0.126
31   Q96BE0     SEC24C 0.126
32   TUBA4A       MYH9 0.081
33   Q96BE0      HSPA2 0.236
34   Q96BE0      PRAME 0.126
35   Q96BE0      FANCC 0.126
36   Q96BE0       HSF2 0.126
37      KDR      MYO1C 0.126
38   Q96BE0      HCFC1 0.126
39   Q96BE0      RAD51 0.126
40      KDR        FYN 0.210
41   Q96BE0      PSMD2 0.126
42   Q96BE0       SKP2 0.126
43      KDR        MET 0.376
44   Q96BE0      IKBKE 0.126
45   Q96BE0      ENDOG 0.126
46   Q96BE0      GNA13 0.126
47   TSG101      EIF3L 0.183
48   Q96BE0     SETDB1 0.126
49   Q96BE0      CDK10 0.126
50 HSP90AB1     TNNI3K 0.126

Thanks to an above suggestion I removed a loop and replaced it with two match() arguements. The orginal code took about 196 seconds do the first iterations, whereas this took only 20.4 seconds

    Nx = 0
DNSList = FALSE
DNSListNameHolder = NA
DNSListValueHolder = NA
DNSListHolder = 0
CN = 0
Prev = 0
system.time(while(Nx < length(ForDNSList$Gene.A))
{
    Nx = Nx + 1
    #Check if Gene A is a candidate disease gene
    if(is.element("TRUE",!is.na(match(Candidate,ForDNSList$Gene.A[Nx]))))
    {
        #if so push the holder one furter and fill the secondary varaibles with the complement and score info
        DNSListHolder = DNSListHolder  + 1
        DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
        DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder]) 
        DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.B[CNN])
    }
    #Check if Gene B is a candidate disease gene
    if(is.element("TRUE",!is.na(match(Candidate,ForDNSList$Gene.B[Nx]))))
    {   
        #if so push the holder one furter and fill the secondary varaibles with the complement and score info
        DNSListHolder = DNSListHolder  + 1
        DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
        DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder]) 
        DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.A[CNN])
    }
    print(Nx)
})

It's best to use one of the apply functions--I think they're optimized to use multiprocessing, so with more cores, your operations with apply will probably go faster. And, besides, using functions is probably better than using loops, as it's more modular and easier to code for.

Here's an example from my own code, showing a partial implementation of the "outlier-resistant" Z-score algorithm:

rw <- assays(sum_exp)$fpkm

    #remove genes that have zero counts
    rw <- rw[apply(rw, 1, function(x){return (sum(x)>0)}),]

    #
    sample_means <- apply(rw, 2, function(x){median(x[x>0])})
    z_median <- median(sample_means)
    z_mad <- mad(sample_means)
    z_scores <- unlist(lapply(sample_means, function(x) {return ((x - z_median)/(z_mad))}))

If you want to conceptualize it, think of the possibility that you want to modify more than one element in one iteration of a for loop, like a loop implementing Fibonacci. R cannot optimize a loop in parallel because it cannot isolate each row/column/element. With apply , sapply , and lapply , you can make the assumption that each row/column/element will be computed in isolation, and therefore, it's safe to divide the work among different cores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM