简体   繁体   English

使用apply将R循环转换为函数形式

[英]Converting R loops to functional form with apply

I've written some R code to parse strings, count occurrences of substrings, and then populate a table of substring counts. 我编写了一些R代码来解析字符串,计算子字符串的出现次数,然后填充一个子字符串计数表。 It works fine, but it's really really slow on the actual data I'm using (which is quite large), and I know a lot of that is because I'm using loops rather than functions from the apply family. 它运行正常,但它对我正在使用的实际数据(这是非常大的)真的很慢,而且我知道很多是因为我使用循环而不是apply系列中的函数。 I've been trying to get this code into functional form and I'm not having any luck, can anyone help? 我一直在尝试将此代码转换为功能形式,我没有运气,任何人都可以帮忙吗? My biggest issue is I can't figure out a way to use the column names to match values within an apply construct. 我最大的问题是我无法找到一种方法来使用列名来匹配apply结构中的值。 Here's the code with some toy data: 这是包含一些玩具数据的代码:

#Create toy data, list of unique substrings
code_frame<-matrix(c(c('a|a|b|c|d'),c('a|b|b|c|c'),c('a|b|c|d|d')),nrow=3,ncol=1)   
all_codes_list<-c('a','b','c','d')

#create data frame with a column for each code and a row for each job
code_count<-as.data.frame(matrix(0, ncol = length(all_codes_list), nrow = nrow(code_frame)))
colnames(code_count)<-all_codes_list

#fill in the code_count data frame with entries where codes occur
for(i in 1:nrow(code_frame)){
    test_string<-strsplit(code_frame[i,1],split="|",fixed=TRUE)[[1]]
    for(j in test_string){
        for(g in 1:ncol(code_count)){
            if(j == all_codes_list[g]){
                code_count[i,g]<-code_count[i,g]+1
                }
            }
        }
    }

Thanks. 谢谢。

A oneliner, split into 3 lines: 一个oneliner,分为3行:

do.call(rbind,
        lapply(strsplit(code_frame[,1], "|", fixed=TRUE),
               function(x) table(factor(x, levels=all_codes_list))))

Note that strsplit is vectorised, so you don't need the outside loop over all rows. 请注意, strsplit是矢量化的,因此您不需要在所有行上使用外部循环。 Your inner loops are basically counting up the occurrences of each code, which is an application of table . 你的内部循环基本上是在计算每个代码的出现次数,这是一个table的应用程序。 Finally, do.call(rbind, *) is the standard idiom for turning a list of rows into a single data frame. 最后, do.call(rbind, *)是将行列表转换为单个数据帧的标准习惯用法。

The qdap package has a tool that's perfect for this and should be very fast and little coding, called mtabulate : qdap软件包有一个非常适合这个的工具,应该非常快速且编码很少,称为mtabulate

library(qdap)    
mtabulate(strsplit(code_frame, "\\|"))

##   a b c d
## 1 2 1 1 1
## 2 1 2 2 0
## 3 1 1 1 2

Basically it takes lists of vectors (output from strsplit ) and makes a row of tabulated info for each vector. 基本上它需要矢量列表(来自strsplit输出)并为每个矢量创建一行表格信息。

EDIT: If speed truly is your thing here are the benchmarks on 1000 replications ( microbenchmark package on Win 7 machine): 编辑:如果速度真的是你的事情在这里是1000复制的基准(Win 7机器上的microbenchmark包 ):

Unit: microseconds
     expr      min       lq   median       uq      max neval
   HONG()  592.458  620.448  632.111  644.706 4650.560  1000
  TYLER()  324.220  342.413  351.743  361.073 3556.613  1000
 HENRIK() 1527.329 1560.450 1578.177 1614.331 4828.297  1000

And visual output: 和视觉输出: 在此输入图像描述

A base alternative: base替代方案:

df <- read.table(text = code_frame, sep = "|")

tt <- apply(df, 1, function(x){
  x2 <- factor(x, levels = letters[1:4])
  table(x2)
  })

t(tt) 

#      a b c d
# [1,] 2 1 1 1
# [2,] 1 2 2 0
# [3,] 1 1 1 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM