简体   繁体   English

在R中将字符串拆分为固定长度元素的最快方法

[英]fastest way to split strings into fixed-length elements in R

How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "") . 如何在R中将字符串拆分为固定长度的元素是一个常见问题,典型的答案依赖于substring(x)strsplit(x, sep="")然后是paste(y, collapse = "") strsplit(x, sep="") paste(y, collapse = "") For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters. 例如,通过指定3个字符的固定长度,可以将字符串"azertyuiop"切成"aze", "rty","uio", "p"

I'm looking for the fastest way possible. 我正在寻找最快的方法。 After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. 在使用长字符串(> 1000个字符)进行一些测试之后,我发现substring()太慢了。 The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness. 因此,策略是将字符串拆分为单个字符,然后通过应用一些技巧将它们粘贴回所需长度的组中。

Here is the fastest function I could come up with. 这是我能想到的最快的功能。 The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator. 想法是将字符串拆分为单个char,然后在字符向量的正确位置插入一个分隔符,将字符(和分隔符)折叠回字符串,然后再次拆分字符串,但这一次指定了分隔符。

splitInParts <- function(string, size) {              #can process a vector of strings. "size" is the length of desired substrings
    chars <- strsplit(string,"",T)
    lengths <- nchar(string)
    nFullGroups <- floor(lengths/size)                #the number of complete substrings of the desired size

    #here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
    seps  <-  Map(rep, ",", lengths + nFullGroups)     #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
    indices <- Map(seq, 1, lengths + nFullGroups)      #the positions at which separators will be replaced by the characters
    indices <- lapply(indices, function(x) which(x %% (size+1) != 0))  #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)

    temp <- function(x,y,z) {        #a fonction describing the replacement, because we call it in the Map() call below
        x[y] <- z
        x
    }
    res <- Map(temp, seps, indices, chars)             #so now we have a vector of chars with separators interspersed
    res <- sapply(res, paste, collapse="", USE.NAMES=F)  #collapses the characters and separators
    res <- strsplit(res, ",", T)                        #and at last, we can split the strings into elements of the desired length
}

This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse="") . 这看起来很繁琐,但是我试图将chars向量简单地放入具有足够行数的矩阵中,然后使用apply(mat, 2, paste, collapse="")折叠矩阵列。 This is MUCH slower. 这要慢得多。 And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower. 使用split()将字符向量split()为正确长度的向量列表以折叠元素的速度甚至更慢。

So if you can find something faster, let me know. 因此,如果您可以更快地找到一些东西,请告诉我。 If not, well my function may be of some use. 如果没有,那么我的功能可能会有用。 :) :)

Was fun reading the updates, so I benchmarked: 阅读更新很有趣,因此我进行了基准测试:

> nchar(mystring)
[1] 260000

My idea was near the same as @akrun's one as str_extract_all use the same function under the hood IIRC) 我的想法与@akrun的想法差不多,因为str_extract_all在IIRC的幕后使用相同的函数)

library(stringr)
tensiSplit <- function(string,size) {
  str_extract_all(string, paste0('.{1,',size,'}'))
}

And the results on my machine: 结果在我的机器上:

> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
                       expr        min         lq       mean     median         uq        max neval
  splitInParts(mystring, 3)   64.80683   64.83033   64.92800   64.85384   64.98858   65.12332     3
    akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983     3
 splitInParts2(mystring, 3)   21.73150   21.73829   21.90200   21.74507   21.98725   22.22942     3
    tensiSplit(mystring, 3)   21.80367   21.85201   21.93754   21.90035   22.00447   22.10859     3
     gsubSplit(mystring, 3)   53.90416   54.28191   54.55416   54.65966   54.87915   55.09865     3

We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ( (?<=.{3}) ). 我们可以split通过指定正则表达式反向预搜索到匹配由字符“n”之前的位置,例如,如果我们分裂由3个字符,我们匹配由3个字符(前面的位置/边界(?<=.{3}) )。

splitInParts <- function(string, size){
    pat <- paste0('(?<=.{',size,'})')
    strsplit(string, pat, perl=TRUE)
 }

splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"  

splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"  

splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"

Or another approach is using stri_extract_all from library(stringi) . 或者另一种方法是使用stri_extract_alllibrary(stringi)

library(stringi)
splitInParts2 <- function(string, size){
   pat <- paste0('.{1,', size, '}')
   stri_extract_all_regex(string, pat)
 }
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"  

stri_extract_all_regex(str1, '.{1,3}')

data 数据

 str1 <- "azertyuiop"

Alright, there was a faster solution published here (d'oh!) 好了,这里发布一个更快的解决方案(天哪!)

Simply 只是

strsplit(gsub("([[:alnum:]]{size})", "\\\\1 ", string)," ",T)

Here using a space as separator. 这里使用空格作为分隔符。 (didn't think about [[:allnum::]]{} ). (没有考虑[[:allnum::]]{} )。

How can I mark my own question as a duplicate? 如何将自己的问题标记为重复? :( :(

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM