How to split a string into smaller strings in C++

Question

I'm having a little trouble understanding how to split a string into substring. In my program, the user inputs a random string into string stemp,and the program keeps any double, triple etc, letters, but also keeps any alphabetical substrings as well

For example, the strings I'd like to be split look like this:

string stemp = "AAABBCDE";
string stemp2 = "HHHHZAB"

I would like to be able to make substrings with the letters that are the same, such as "AAA", "BB", and any still keep alphabetical strings following that, such as "CDE". In stemp2, I'd keep the "HHHH" but would not keep the "ZAB".

Please help point me in the right direction, I'm losing my mind.

Answer 1

This ended up being not such a simple parse. The reason, not only do you have to handle transitions from a sequence of duplicates to a new sequence of duplicates (eg "AAABB" ), and handle the transition from a sequence of duplicates to a series of characters (eg "AAABCDE" ), but also handle the transition back (eg "AAAHIJKLBBCDE" ) any number of times, and also handle duplicates through the end (eg "AAABCDEGG" ).

There are a fair number of caveats there. One approach is to handle parsing the string in a continual loop, advancing the index based on the number of duplicates in sequence or the number of characters in series. A basic outline would be:

    loop continually over indexes in string {
        while (sequence of duplicates) {
            extract duplicates substring
            advance index by no. of duplicates
        }
        while (characters in series in sort-order) {
            increment counter
            advance index
        }
        if (counter > 2) {
            extract series substring
        }

Now within each of those blocks you also need to handle end-of-string. With that in mind and presuming you will extract each substring found in std::string s; and store the substring in a std::vector<std::string> vs; , you could do something similar to the following using std::basic_string::find_first_not_of to check for the sequence of duplicates for you:

    for (size_t i = 0; ;) {                 /* loop until string exhausted */
        bool dupsadded = false;             /* flag for whether duplicates found */
        size_t spos = 0, nchr = 0;          /* string position and number of chars */
        /* loop extracting duplicate characters */
        while ((spos = s.find_first_not_of (s[i], i)) && 
                /* duplicates do not extend to end of string */
                ((spos != std::string::npos && spos - i > 1) ||
                /* duplicates do extend to end of string */ 
                (s.substr(i).length() > 1 && spos == std::string::npos))) {
            if (spos != std::string::npos) {        /* handle not through end */
                nchr = spos - i;                    /* no of chars duplicate chars */
                vs.push_back (s.substr (i, nchr));  /* add to vector of substrings */
                i += nchr;                          /* incremnt index */
                dupsadded = true;                   /* set dupsadded flag */
                nchr = 0;                           /* zero nchr */
            }
            else {                                  /* duplicates to end of string */
                vs.push_back (s.substr (i));        /* add remaining substring */
                goto done;
            }
        }
        if (!i || dupsadded)                    /* 1st char or dups found */
            i += 1;                             /* advance past last dup as s[i-1] */
        while (s[i] && s[i-1] + 1 == s[i]) {    /* while characters in sequance */
            nchr += 1;                          /* increment char count */
            i += 1;                             /* increment index */
        }
        if (nchr > 1)       /* if nchr > 1 (3 in sequence) */
            vs.push_back (s.substr (i - nchr - 1, nchr + 1));
        else if (!s[i])     /* if at end */
            break;          /* break */
        else                /* otherwise */
            i += 1;         /* increment index */
    }
    done:;

Other than the user of .find_first_not_of() , the remainder of the function just relies on good old arithmetic. There are many different ways to write this, but here if the set of characters were not a series of duplicates, then the ASCII values of adjacent characters were compared to determine if a series of characters in sort order was present. See ASCII Table & Description .

The transition for sequence of duplicates to series in sort-order was particularly problematic as the comparison for sort order relied on comparing s[i-1] + 1 == s[i] which would have compared the last duplicate character if the index was not further adjusted by 1 so that s[i-1] was actually the next character after the sequence of duplicates. (it's not anything magic, it just depends on how you make the comparison of adjacent characters while protecting for end-of-string at the same time). I guess the right way to put it is the arithmetic required special attention to handle that transition.

Putting a short example together, you could do:

#include <iostream>
#include <string>
#include <vector>

int main (void) {
    
    std::string s{};                        /* string for user input */
    std::vector<std::string> vs{};          /* vector of string to hold substrings */
    
    std::cout << "enter string: ";
    if (!(std::cin >> s)) {
        std::cout << "(user canceled input)\n";
        return 0;
    }
    if (s.length() < 2) {   /* validate at least 2 characters */
        std::cerr << "error: must have more than 1 character.\n";
        return 1;
    }
    
    for (size_t i = 0; ;) {                 /* loop until string exhausted */
        bool dupsadded = false;             /* flag for whether duplicates found */
        size_t spos = 0, nchr = 0;          /* string position and number of chars */
        /* loop extracting duplicate characters */
        while ((spos = s.find_first_not_of (s[i], i)) && 
                /* duplicates do not extend to end of string */
                ((spos != std::string::npos && spos - i > 1) ||
                /* duplicates do extend to end of string */ 
                (s.substr(i).length() > 1 && spos == std::string::npos))) {
            if (spos != std::string::npos) {        /* handle not through end */
                nchr = spos - i;                    /* no of chars duplicate chars */
                vs.push_back (s.substr (i, nchr));  /* add to vector of substrings */
                i += nchr;                          /* incremnt index */
                dupsadded = true;                   /* set dupsadded flag */
                nchr = 0;                           /* zero nchr */
            }
            else {                                  /* duplicates to end of string */
                vs.push_back (s.substr (i));        /* add remaining substring */
                goto done;
            }
        }
        if (!i || dupsadded)                    /* 1st char or dups found */
            i += 1;                             /* advance past last dup as s[i-1] */
        while (s[i] && s[i-1] + 1 == s[i]) {    /* while characters in sequance */
            nchr += 1;                          /* increment char count */
            i += 1;                             /* increment index */
        }
        if (nchr > 1)       /* if nchr > 1 (3 in sequence) */
            vs.push_back (s.substr (i - nchr - 1, nchr + 1));
        else if (!s[i])     /* if at end */
            break;          /* break */
        else                /* otherwise */
            i += 1;         /* increment index */
    }
    done:;
    
    for (const auto &ss : vs)               /* output results */
        std::cout << ss << '\n';
}

( note: there is a lot there, this isn't a skim over and understand the logic. Take aa pencil a piece of paper and write out the input string and then work through each iteration tracking the value of i (the index) positioning a mark below the current character and noting the values for spos returned from .find_first_not_of() , nchr and noting the value use to extract the characters using .substr() . That's probably the best way to approach understanding what is happening at each point -- similar to a conversation with the duck fromHow to debug small programs )

Example Use/Output

Your first example:

$ /bin/str_substr_dup_or_seq
enter string: AAABBCDE
AAA
BB
CDE

Your second example:

$ ./bin/str_substr_dup_or_seq
enter string: HHHHZAB
HHHH

An extension testing additional transitions:

$ ./bin/str_substr_dup_or_seq
enter string: AAAHIJKLBBCDE
AAA
HIJKL
BB
CDE

With a sequence of duplicates at the end:

$ ./bin/str_substr_dup_or_seq
enter string: AAABCDEGG
AAA
BCDE
GG

With a sequence of duplicates with the same character as the last in series:

$ ./bin/str_substr_dup_or_seq
enter string: AAABCDEEE
AAA
BCDE
EE

(there is a slight ambiguity whether you want "BCD" and "EEE" instead -- both would satisfy your contstraints. Further implementation to change the behavior is left to you)

Or a bit more of a challenge with "ISHKABIBBLE" (Yiddish for nonsense) inserted within "AAAHIJKLBBCDE" with another trailing "IE" added to the end:

$ ./bin/str_substr_dup_or_seq
enter string: AAAHIJKLISHKABIBBLEBBCDEIE
AAA
HIJKL
BB
BB
CDE

Look things over and let me know if you have further questions.

How to split a string into smaller strings in C++

Question

1 answers

solution1
2 2020-12-03 03:33:10

How to split a string into smaller strings in C++

Question

1 answers

solution1 2 2020-12-03 03:33:10

solution1
2 2020-12-03 03:33:10