简体   繁体   English

是否有 std::string 子字符串的替代方法?

[英]Is there an alternative to the std::string substring?

Given a string s = "RADILAMIA" I want to take all the substrings of length 4 (or something else).给定一个字符串s = "RADILAMIA"我想取长度为 4(或其他)的所有子字符串。

If len == 4 then the substrings are: "RADI","ADIL","DILA","ILAM","LAMI","AMIA".如果len == 4则子串是:“RADI”、“ADIL”、“DILA”、“ILAM”、“LAMI”、“AMIA”。 It seems easy to do that by using the std::string substr method:使用std::string substr方法似乎很容易做到这一点:

vector<string> allSubstr(string s,int len) {
    vector<string>ans;
    for(int i=0;i<=s.size()-len;i++) {
        ans.push_back(s.substr(i,len));
    }
    return ans;
}

substr 's time complexity is unspecified, but generally linear against the length of the substring. substr的时间复杂度未指定,但通常与子字符串的长度成线性关系。

Can I do this without std::string substr .我可以在没有std::string substr情况下做到这一点。 Any substring and the previous substring differ in only one letter.任何子串和前一个子串只有一个字母不同。 Is there any better way to reduce the time complexity?有没有更好的方法来降低时间复杂度?

There can be millions of different approaches.可以有数百万种不同的方法。 Here is my algorithm.这是我的算法。

vector<string> allSubstr(string s,int len) {

    vector<string>ans;
    ans.reserve(s.size() - len );
    for(size_t i=0;i<=s.size()-len;i++) 
    {
        ans.emplace_back( s.begin() +i, s.begin() + i + len );
    }

    return ans;
}

It is tested.它经过测试。 I mean it wouldn't matter what you are using but emplace_back above can make a difference since there won't be copy cost.我的意思是你使用什么并不重要,但上面的 emplace_back 会有所作为,因为不会有复制成本。 Also you add reserve for more performance.您还可以添加储备以获得更高的性能。

No matter what you do, you still need O(NL) time to write all your substrings into the vector.无论您做什么,您仍然需要 O(NL) 时间将所有子字符串写入向量。

The fastest thing would be probably:最快的事情可能是:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i] = s.substr(i, len);
}

Because push_back is slowish, and should generally be avoided if possible.因为push_back很慢,如果可能的话通常应该避免。 It is overused.它被过度使用了。

PS: maybe this code would be even faster: PS:也许这段代码会更快:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i].append(s.begin()+i, s.begin()+i+len);
}

string_view (C++17) has a constant time substr : string_view (C++17) 有一个恒定时间substr

vector<string_view> allSubstr(const string_view& s, int len) {
    vector<string_view> ans;
    and.reserve(s.size() - len + 1);
    for (int i = 0 ; i <= s.size() - len; ++i) {
        ans.push_back(s.substr(i, len));
    }
    return ans;
}

Just make sure that s outlives the return value of the function.只要确保s比函数的返回值更长。

Probably you could use an array of chars instead.也许你可以使用一个字符数组来代替。 For example, you have got your word:例如,你有话要说:

char s[] = "RADILAMIA";

To deal with all necessary substrings you can use such approach:要处理所有必要的子字符串,您可以使用这种方法:

int substLength = 4;
int length = strlen(s);
char buffer[256];
for (int i = 0; i < length - substLength + 1; i++) {
    strncpy(buffer, s + i, substLength);
    buffer[substLength] = '\0';
    cout << buffer << endl;
}

Using the char array you easily can access to the start of any substring by adding the necessary index to the beginning of the array.使用 char 数组,您可以通过将必要的索引添加到数组的开头来轻松访问任何子字符串的开头。

It pays to revisit the docos重新审视文档是值得的

// string proto(len);
vector<string> result(s.size()-len, string(len, char(32))); // preallocates the buffers

const char *str=s.c_str();
const char* end=str+s.size()-len;

for(size_t i=0; str<end; str++, i++) {
  result[i].assign(str, len); // likely to result in a simple copy in the preallocated buffer
}

The complexity is the same O(len*s.size()) - one can only hope for a smaller proportionality factor.复杂性是相同的 O(len*s.size()) - 只能希望比例因子更小。

C is not always faster than C++ but @Fomalhaut was right to post the performant core solution in C. Here is my (C program) complete version, based on his algorithm. C 并不总是比 C++ 快,但 @Fomalhaut 用 C 发布高性能核心解决方案是正确的。这是我的(C 程序)完整版本,基于他的算法。 Without using strncpy, too.也不使用 strncpy 。

Here it is on the godbolt .这是在神弩上

#ifdef __STDC_ALLOC_LIB__
#define __STDC_WANT_LIB_EXT2__ 1
#else
#define _POSIX_C_SOURCE 200809L
#endif

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <malloc.h>

//////////////////////////////////////////////////////////////
// array of buffers == a_of_b
typedef struct a_of_b  {
    const unsigned size;
    unsigned count ;
    char ** data   ;
} a_of_b ;

a_of_b a_of_b_make ( const unsigned size_ ) 
{
   return (a_of_b){ .size = size_, .count = 0, .data = calloc(1, sizeof(char * [size_] ) ) } ;
}

a_of_b * a_of_b_append ( a_of_b * self,  const unsigned len_, const char str_[len_] ) 
{
    assert( self->data ) ;
    assert( self->size > self->count ) ;
    self->data[ self->count ] = strndup( str_, len_ ) ;
    self->count += 1;
    return self ;
}

a_of_b * a_of_b_print ( a_of_b * self , const char * fmt_ ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         printf( fmt_ , self->data[j]);
    return self ;
}

a_of_b * a_of_b_free ( a_of_b * self  ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         free( self->data[j]) ;
    free( self->data) ;
    self->count = 0  ;         
    return self ;
}
//////////////////////////////////////////////////////////////
a_of_b breakit ( const unsigned len_, const char input_[len_], const unsigned  substLength )
{
    assert( len_ > 2 ) ;
    assert( substLength > 0 ) ;
    assert( substLength < len_ ) ;

    const unsigned count_of_buffers = len_ - substLength + 1;

    a_of_b rez_ = a_of_b_make( count_of_buffers +1 ) ;

    for (int i = 0; i < count_of_buffers ; i++) {
        a_of_b_append( &rez_, substLength, input_ + i ) ;
    }

   return rez_ ;
}
//////////////////////////////////////////////////////////////
static void driver( const char * input_, const unsigned substLength ) 
{
    printf("\n");
    a_of_b substrings = breakit( strlen(input_), input_, substLength );
    a_of_b_print( & substrings , "%s ");
    a_of_b_free( & substrings);
}
//////////////////////////////////////////////////////////////
int main () { 

    driver( "RADILAMIA", 4) ;
    driver( "RADILAMIA", 3) ;
    driver( "RADILAMIA", 2) ;
    driver( "RADILAMIA", 1) ;
    
    return EXIT_SUCCESS; 
}

And the program output is:程序输出是:

RADI ADIL DILA ILAM LAMI AMIA 

RAD ADI DIL ILA LAM AMI MIA 

RA AD DI IL LA AM MI IA 

R A D I L A M I A 

Enjoy.享受。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM