Make a function using apply, stringr, stringi, and rbind run faster

Question

The background: I'm going to provide the background for the application of this code and the programmatic background. Hopefully both help. I do genomics computational work. Yep - just another biologist posing as a computer scientist. I am working on a script that will allow me to integrate a bunch of data sets by each position in the human genome. This translates to a dataframe that is over 3 billion rows by 12 columns. As a test dataset, I'm building my analysis pipeline using the yeast genome, which will generate a dataframe with about 25 million rows and 12 columns.

The problem: My current code works fine, but is brutally slow. For example, I started my pipeline 45 minutes ago, and its about 1/3 of the way through the yeast genome. This means it will likely take 135 minutes to finish one yeast sample, or 270 hours for 1 human sample... now multiply that by the 90 human samples I am preparing to analyze, and you can hopefully see my problem. I need to speed this up. I will be parallelizing this, but even then I think the code it's self is too clunky. I want help making my existing function much, much faster. Please do not tell me I need to parallelize it (that will get a down vote).

Example data:

chrom <- c("chr1", "chr1", "chr1", "chr1")
start <- c("0","1","2","6")
stop <- c("1","2","6","7")
sequence <- c("a", "t", "tcag", "a")
seqData <- data.frame(chrom, start, stop, sequence)

Example output:

chrom_out <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_out <- c("0", "1", "2", "3", "4", "5", "6")
stop_out <- c("1", "2", "3", "4", "5", "6", "7")
sequence_out <- c("a", "t", "t", "c", "a", "g", "a")
out_seqdata <- data.frame(chrom_out, start_out, stop_out, sequence_out)

Current Code:

library(dplyr)
library(stringi)
library(stringr) 


wl = function(x){

  length<- stri_length(x["sequence"])
  if(length ==1){
    tmpseq<- x["sequence"]
    tmpstart <- as.numeric(x["start"])
    tmpstop <- as.numeric(x["stop"])
    tmpchrom <- x["chrom"]
    tmpdf <- data.frame(tmpseq, tmpstart, tmpstop, tmpchrom)
    colnames(tmpdf)<- c("tmpseq", "tmpstart", "tmpstop", "tmpchrom")
    print(tmpdf)
  }else{
    tmpseq<- strsplit(x["sequence"], "(?<=.{1})", perl = TRUE)
    tmpstart <- as.numeric(x["start"])+(1:length-1)
    tmpstop<- as.numeric(x["start"])+(1:length)
    tmpdf <- data.frame(tmpseq, tmpstart, tmpstop)
    tmpdf$tmpchrom <- x["chrom"]
    colnames(tmpdf)<- c("tmpseq", "tmpstart", "tmpstop", "tmpchrom")
    print(tmpdf)
  }
}

Explanation of code: I use apply to iterate over each row of a dataframe. The data frame is a list of coordinates and the genomic sequence for those coordinates. Chrom = chromosome, start = the starting position on the chromosome, stop = the stopping position, and sequence is the actual sequence. The data is currently in a compressed format, exemplified by the third row of data. I want to expand this data such that each genomic letter becomes its own row, and then adjust the coordinate range appropriately. The function wl (stands for wide to long) performs this. It first determines the string length of the sequence. If the length is equal to 1 it returns that row as a dataframe without further manipulation; else it breaks up the string into individual characters, determines the coordinates for each character, and returns this dataframe. The result is a list of dataframes which then rbind together, producing the example output data.

What I need: I'm going to chunk the genome, creating a list, thus allowing me to parallelize that list. The chunks will result in a series of dataframes of ~25 million rows in length. I'm going to parallelize multiple samples too. Parallelization within a parallelization... sounds like a great way to crash a cluster. I know how to do this (both write this code and crash a cluster). What I need help with is making the actual function faster. 25 million rows still takes a long time to process using my current function. Any ideas would be greatly appreciated. Please edit my function or recommend a new approach - all ideas are welcome. I'm not aware of faster methods, other than adding more horsepower.

Answer 1

You can vectorise all your operations:

# Generate vector of start positions
# Goes from 0 (minimal position in given data) to maximum base position in chromosome
foo <- 0:max(as.numeric(as.character(seqData$start)))
# Split sequence into a character vector
bar <- unlist(strsplit(as.character(seqData$sequence), ""))
# Generate final data frame
data.frame(start = foo, end = foo + 1, seq = bar)
#   start end seq
# 1     0   1   a
# 2     1   2   t
# 3     2   3   t
# 4     3   4   c
# 5     4   5   a
# 6     5   6   g
# 7     6   7   a

You can use this code to iterate over the chromosomes one at a time.

Custom function and easily parallelizable foreach loop might look like this:

wl <- function(data, chr) {
    startPos <- 0:max(as.numeric(as.character(data$start)))
    nucs     <- unlist(strsplit(as.character(data$sequence), ""))
    data.frame(chr, start = startPos, end = startPos + 1, seq = nucs)
}
library(foreach)
# use dopar for parallel computations 
foreach(i = unique(seqData$chr), .combine = rbind) %do% {
    wl(subset(seqData, chrom == i), i)
}

PS: I would never use genome coordinates as a character vector. Also, creating end column is just a waste of space as you know that it's positioned by 1 from start .

Make a function using apply, stringr, stringi, and rbind run faster

Question

1 answers

solution1
4 ACCPTED 2018-12-13 23:04:58

Make a function using apply, stringr, stringi, and rbind run faster

Question

1 answers

solution1 4 ACCPTED 2018-12-13 23:04:58

solution1
4 ACCPTED 2018-12-13 23:04:58