简体   繁体   中英

R: Optimizing a For Loop when input data frame is very large

I have a huge dataframe called ppiensemble , containing almost 5 million rows. Here is a sample:

> head(ppiensemble, 10)
          protein1        protein2
1  ENSP00000000233 ENSP00000020673
2  ENSP00000000233 ENSP00000054666
3  ENSP00000000233 ENSP00000158762
4  ENSP00000000233 ENSP00000203407
5  ENSP00000000233 ENSP00000203630
6  ENSP00000000233 ENSP00000215071
7  ENSP00000000233 ENSP00000215115
8  ENSP00000000233 ENSP00000215375
9  ENSP00000000233 ENSP00000215565
10 ENSP00000000233 ENSP00000215574

The goal here is to convert all items in the column protein1 to an alternate ID coming from a separate dataframe called idconversiontable . I want to extract the corresponding character in idconversiontable$From . Note as well that idconversiontable has only around 50000 rows:

> head(idconversiontable, 10)
                To   From
1  ENSP00000167825 Q9HCE6
2  ENSP00000355060 Q9HCE6
3  ENSP00000364564 Q9HCE6
4  ENSP00000244303 Q9Y2N7
5  ENSP00000300862 Q9Y2N7
6  ENSP00000366898 Q9Y2N7
7  ENSP00000255324 Q9BXT8
8  ENSP00000255325 Q9BXT8
9  ENSP00000322242 Q8N5U6
10 ENSP00000415682 Q8N5U6

So, I try to do that below by setting up a vector called demo1 for protein1 . It works for small sets, but this is just ridiculous...it's taking forever. Plus, I will eventually do the same for protein2 as well. Any ideas on how to expedite this process?

demo1 <- vector(mode="character", length=nrow(ppiensemble))
for(i in 1:nrow(ppiensemble)) {
  demo1[i] <- try(ifelse(ppiensemble$protein1[i] %in% idconversiontable$To,
  as.character(idconversiontable[which(idconversiontable$To == ppiensemble$protein1[i]), 2]),
  "NA"))
    }

Additionally (under the same topic of "optimization"), is there a way to print a message every time 5000 rows are done (ie, everytime i == a multiple of 5000)?

Think of your conversion table as a map

map = setNames(idconversiontable$From, idconversiontable$To)

Then use the names on the map to go from protein id to gene symbol

genes = map[ppiensemble$protein1]

This 'just works' when looking up symbols that aren't present, eg,

map = setNames(c("a", "b"), c("A", "B"))
map[c("A", "C")]
##   A <NA> 
## "a"   NA 

or maybe a slightly improved (?) version

unname(map[c("A", "C")])
## [1] "a" NA 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM