[英]fastest way to split strings into fixed-length elements in R
How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x)
or strsplit(x, sep="")
followed by paste(y, collapse = "")
. 如何在R中将字符串拆分为固定长度的元素是一个常见问题,典型的答案依赖于
substring(x)
或strsplit(x, sep="")
然后是paste(y, collapse = "")
strsplit(x, sep="")
paste(y, collapse = "")
。 For instance, one would slit the string "azertyuiop"
into "aze", "rty","uio", "p"
by specifying a fixed length of 3 characters. 例如,通过指定3个字符的固定长度,可以将字符串
"azertyuiop"
切成"aze", "rty","uio", "p"
。
I'm looking for the fastest way possible. 我正在寻找最快的方法。 After some testing with long strings (> 1000 chars), I have found that
substring()
is way too slow. 在使用长字符串(> 1000个字符)进行一些测试之后,我发现
substring()
太慢了。 The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness. 因此,策略是将字符串拆分为单个字符,然后通过应用一些技巧将它们粘贴回所需长度的组中。
Here is the fastest function I could come up with. 这是我能想到的最快的功能。 The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
想法是将字符串拆分为单个char,然后在字符向量的正确位置插入一个分隔符,将字符(和分隔符)折叠回字符串,然后再次拆分字符串,但这一次指定了分隔符。
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars
vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse="")
. 这看起来很繁琐,但是我试图将
chars
向量简单地放入具有足够行数的矩阵中,然后使用apply(mat, 2, paste, collapse="")
折叠矩阵列。 This is MUCH slower. 这要慢得多。 And splitting the character vector with
split()
into a list of vectors of the right length, so as to collapse elements, is even slower. 使用
split()
将字符向量split()
为正确长度的向量列表以折叠元素的速度甚至更慢。
So if you can find something faster, let me know. 因此,如果您可以更快地找到一些东西,请告诉我。 If not, well my function may be of some use.
如果没有,那么我的功能可能会有用。 :)
:)
Was fun reading the updates, so I benchmarked: 阅读更新很有趣,因此我进行了基准测试:
> nchar(mystring)
[1] 260000
My idea was near the same as @akrun's one as str_extract_all use the same function under the hood IIRC) 我的想法与@akrun的想法差不多,因为str_extract_all在IIRC的幕后使用相同的函数)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine: 结果在我的机器上:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split
by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ( (?<=.{3})
). 我们可以
split
通过指定正则表达式反向预搜索到匹配由字符“n”之前的位置,例如,如果我们分裂由3个字符,我们匹配由3个字符(前面的位置/边界(?<=.{3})
)。
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all
from library(stringi)
. 或者另一种方法是使用
stri_extract_all
从library(stringi)
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!) 好了,这里发布了一个更快的解决方案(天哪!)
Simply 只是
strsplit(gsub("([[:alnum:]]{size})", "\\\\1 ", string)," ",T)
Here using a space as separator. 这里使用空格作为分隔符。 (didn't think about
[[:allnum::]]{}
). (没有考虑
[[:allnum::]]{}
)。
How can I mark my own question as a duplicate? 如何将自己的问题标记为重复? :(
:(
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.