For each iteration of my code I start by finding values from a PPI network from a source of genes, which are recalculated at the end of each run. If the gene is found in either the A or B category I store both the score and gene of its completement in another location to be sorted and concatenated for further testing. The search process takes time and I am looking for anyway to optimize it. The issue is I have begun to work with 300+ genes in a batch and three days to run a calcultion is to long. Extra information the matrix is of all interaction within the PPI network totaling ~176,000X3.
The slow code:
#CREATE THE DNS LIST
DNSList = FALSE
DNSListNameHolder = NA
DNSListValueHolder = NA
DNSListHolder = 0
CN = 0
Prev = 0
while(!DNSList)
{
CN = CN + 1
IDNSList = FALSE
CNN = Prev
while(!IDNSList)
{
CNN = CNN + 1
if(as.character(ForDNSList$Gene.A[CNN]) == as.character(Candidate[CN]))
{
DNSListHolder = DNSListHolder + 1
DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder])
DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.B[CNN])
}
if(as.character(ForDNSList$Gene.B[CNN]) == as.character(Candidate[CN]))
{
DNSListHolder = DNSListHolder + 1
DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder])
DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.A[CNN])
}
if(CNN == length(ForDNSList$Gene.A))
IDNSList = TRUE
}
if(CN == length(Candidate))
DNSList = TRUE
print(paste("Pre-DNS List in Progress",CN/length(Candidate), sep = " "))
}
print("Pre-DNS List Completed")
For Example purposes the Candidate list can be set to this
Candidate = c("BRCA1", "BRCA2", "ATK1", "FYN")
The ForDNSList is long so here is a small excerpt to get the idea of how to list looks. It is more/less random if the gene I am searching for is in gene column A or B.
> ForDNSList[1:50, 1:3]
Gene.A Gene.B Score
1 Q96BE0 POLR3A 0.126
2 Q96BE0 PDPK1 0.126
3 Q96BE0 MGEA5 0.126
4 Q96BE0 DNAJA2 0.126
5 Q96BE0 DNAJB6 0.126
6 Q96BE0 BAG4 0.126
7 Q96BE0 HSPA4L 0.126
8 THAP1 A0A024RA76 0.332
9 Q96BE0 BAG2 0.236
10 Q96BE0 BAG3 0.236
11 Q96BE0 EGFR 0.236
12 Q96BE0 MOS 0.126
13 Q96BE0 RAF1 0.126
14 Q96BE0 GABRB1 0.126
15 Q96BE0 GNAZ 0.126
16 MS4A7 HMGCL 0.286
17 Q96BE0 ATP5A1 0.126
18 Q96BE0 DNAJA1 0.126
19 DVL3 PPM1A 0.210
20 Q96BE0 MCM5 0.126
21 Q96BE0 MCM7 0.126
22 Q96BE0 HSPA4 0.126
23 Q96BE0 PSMC2 0.126
24 Q96BE0 GNAL 0.126
25 Q96BE0 AMT 0.126
26 MECP2 SOX18 0.286
27 Q96BE0 CSNK1E 0.126
28 Q96BE0 ST13 0.126
29 CSNK2A1 MYH9 0.454
30 Q96BE0 CDK9 0.126
31 Q96BE0 SEC24C 0.126
32 TUBA4A MYH9 0.081
33 Q96BE0 HSPA2 0.236
34 Q96BE0 PRAME 0.126
35 Q96BE0 FANCC 0.126
36 Q96BE0 HSF2 0.126
37 KDR MYO1C 0.126
38 Q96BE0 HCFC1 0.126
39 Q96BE0 RAD51 0.126
40 KDR FYN 0.210
41 Q96BE0 PSMD2 0.126
42 Q96BE0 SKP2 0.126
43 KDR MET 0.376
44 Q96BE0 IKBKE 0.126
45 Q96BE0 ENDOG 0.126
46 Q96BE0 GNA13 0.126
47 TSG101 EIF3L 0.183
48 Q96BE0 SETDB1 0.126
49 Q96BE0 CDK10 0.126
50 HSP90AB1 TNNI3K 0.126
Thanks to an above suggestion I removed a loop and replaced it with two match() arguements. The orginal code took about 196 seconds do the first iterations, whereas this took only 20.4 seconds
Nx = 0
DNSList = FALSE
DNSListNameHolder = NA
DNSListValueHolder = NA
DNSListHolder = 0
CN = 0
Prev = 0
system.time(while(Nx < length(ForDNSList$Gene.A))
{
Nx = Nx + 1
#Check if Gene A is a candidate disease gene
if(is.element("TRUE",!is.na(match(Candidate,ForDNSList$Gene.A[Nx]))))
{
#if so push the holder one furter and fill the secondary varaibles with the complement and score info
DNSListHolder = DNSListHolder + 1
DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder])
DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.B[CNN])
}
#Check if Gene B is a candidate disease gene
if(is.element("TRUE",!is.na(match(Candidate,ForDNSList$Gene.B[Nx]))))
{
#if so push the holder one furter and fill the secondary varaibles with the complement and score info
DNSListHolder = DNSListHolder + 1
DNSListValueHolder[DNSListHolder] = as.character(ForDNSList$Score[CNN])
DNSListValueHolder[DNSListHolder] = as.numeric(DNSListValueHolder[DNSListHolder])
DNSListNameHolder [DNSListHolder] = as.character(ForDNSList$Gene.A[CNN])
}
print(Nx)
})
It's best to use one of the apply
functions--I think they're optimized to use multiprocessing, so with more cores, your operations with apply will probably go faster. And, besides, using functions is probably better than using loops, as it's more modular and easier to code for.
Here's an example from my own code, showing a partial implementation of the "outlier-resistant" Z-score algorithm:
rw <- assays(sum_exp)$fpkm
#remove genes that have zero counts
rw <- rw[apply(rw, 1, function(x){return (sum(x)>0)}),]
#
sample_means <- apply(rw, 2, function(x){median(x[x>0])})
z_median <- median(sample_means)
z_mad <- mad(sample_means)
z_scores <- unlist(lapply(sample_means, function(x) {return ((x - z_median)/(z_mad))}))
If you want to conceptualize it, think of the possibility that you want to modify more than one element in one iteration of a for loop, like a loop implementing Fibonacci. R cannot optimize a loop in parallel because it cannot isolate each row/column/element. With apply
, sapply
, and lapply
, you can make the assumption that each row/column/element will be computed in isolation, and therefore, it's safe to divide the work among different cores.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.