I have a huge dataframe called ppiensemble
, containing almost 5 million rows. Here is a sample:
> head(ppiensemble, 10)
protein1 protein2
1 ENSP00000000233 ENSP00000020673
2 ENSP00000000233 ENSP00000054666
3 ENSP00000000233 ENSP00000158762
4 ENSP00000000233 ENSP00000203407
5 ENSP00000000233 ENSP00000203630
6 ENSP00000000233 ENSP00000215071
7 ENSP00000000233 ENSP00000215115
8 ENSP00000000233 ENSP00000215375
9 ENSP00000000233 ENSP00000215565
10 ENSP00000000233 ENSP00000215574
The goal here is to convert all items in the column protein1
to an alternate ID coming from a separate dataframe called idconversiontable
. I want to extract the corresponding character in idconversiontable$From
. Note as well that idconversiontable
has only around 50000 rows:
> head(idconversiontable, 10)
To From
1 ENSP00000167825 Q9HCE6
2 ENSP00000355060 Q9HCE6
3 ENSP00000364564 Q9HCE6
4 ENSP00000244303 Q9Y2N7
5 ENSP00000300862 Q9Y2N7
6 ENSP00000366898 Q9Y2N7
7 ENSP00000255324 Q9BXT8
8 ENSP00000255325 Q9BXT8
9 ENSP00000322242 Q8N5U6
10 ENSP00000415682 Q8N5U6
So, I try to do that below by setting up a vector called demo1
for protein1
. It works for small sets, but this is just ridiculous...it's taking forever. Plus, I will eventually do the same for protein2
as well. Any ideas on how to expedite this process?
demo1 <- vector(mode="character", length=nrow(ppiensemble))
for(i in 1:nrow(ppiensemble)) {
demo1[i] <- try(ifelse(ppiensemble$protein1[i] %in% idconversiontable$To,
as.character(idconversiontable[which(idconversiontable$To == ppiensemble$protein1[i]), 2]),
"NA"))
}
Additionally (under the same topic of "optimization"), is there a way to print a message every time 5000 rows are done (ie, everytime i == a multiple of 5000)?
Think of your conversion table as a map
map = setNames(idconversiontable$From, idconversiontable$To)
Then use the names on the map to go from protein id to gene symbol
genes = map[ppiensemble$protein1]
This 'just works' when looking up symbols that aren't present, eg,
map = setNames(c("a", "b"), c("A", "B"))
map[c("A", "C")]
## A <NA>
## "a" NA
or maybe a slightly improved (?) version
unname(map[c("A", "C")])
## [1] "a" NA
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.