简体   繁体   中英

Cross-referencing data frames without using for loops

Im having an issue with speed of using for loops to cross reference 2 data frames. The overall aim is to identify rows in data frame 2 that lie between coordinates specified in data frame 1 (and meet other criteria). eg df1:

    chr     start       stop        strand
1   chr1    179324331   179327814   +
2   chr21   45176033    45182188    +
3   chr5    126887642   126890780   +
4   chr5    148730689   148734146   +

df2:

    chr     start       strand
1   chr1    179326331   +
2   chr21   45175033    +
3   chr5    126886642   +
4   chr5    148729689   +

My current code for this is:

for (index in 1:nrow(df1)) { 
  found_miRNAs <- ""
  curr_row = df1[index, ]; 
for (index2 in 1:nrow(df2)){
    curr_target = df2[index2, ]
    if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {
      found_miRNAs <- paste(found_miRNAs, curr_target$start, sep=":")
    }
  }
  curr_row$miRNAs <- found_miRNAs
  found_log <- rbind(Mcf7_short_aUTRs2,curr_row)
}

My actual data frames are 400 lines for df1 and > 100 000 lines for df2 and I am hoping to do 500 iterations, so, as you can imagine this unworkably slow. I'm relatively new to R so any hints for functions that may increase the efficiency of this would be great.

Maybe not fast enough, but probably faster and a lot easier to read:

df1 <- data.frame(foo=letters[1:5], start=c(1,3,4,6,2), end=c(4,5,5,9,4))
df2 <- data.frame(foo=letters[1:5], start=c(3,2,5,4,1))
where <- sapply(df2$start, function (x) which(x >= df1$start & x <= df1$end))

This will give you a list of the relevant rows in df1 for each row in df2. I just tried it with 500 rows in df1 and 50000 in df2. It finished in a second or two.

To add criteria, change the inner function within sapply . If you then want to put where into your second data frame, you could do eg

df2$matching_rows <- sapply(where, paste, collapse=":")

But you probably want to keep it as a list, which is a natural data structure for it.

Actually, you can even have a list column it in the data frame:

df2$matching_rows <- where

though this is quite unusual.

You've run into two of the most common mistakes people make when coming to R from another programming language. Using for loops instead of vector-based operations and dynamically appending to a data object. I'd suggest as you get more fluent you take some time to read Patrick Burns' R Inferno , it provides some interesting insight into these and other problems.

As @David Arenburg and @zx8754 have pointed out in the comments above there are specialized packages that can solve the problem, and the data.table package and @David's approach can be very efficient for larger datasets. But for your case base R can do what you need it to very efficiently as well. I'll document one approach here, with a few more steps than necessary for clarity, just in case you're interested:

set.seed(1001)

ranges <- data.frame(beg=rnorm(400))
ranges$end <- ranges$beg + 0.005

test <- data.frame(value=rnorm(100000))
##  Add an ID field for duplicate removal:
test$ID <- 1:nrow(test)


##  This is where you'd set your criteria.  The apply() function is just 
##      a wrapper for a for() loop over the rows in the ranges data.frame:
out <- apply(ranges, MAR=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "ID"])

selected <- unlist(out)
selected <- unique( selected )

selection <- test[ selected, ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM