简体   繁体   中英

C++ Tokenize a string with spaces and quotes

I would like to write something in C++ that tokenize a string. For the sake of clarity, consider the following string:

add string "this is a string with spaces!"

This must be split as follows:

add
string
this is a string with spaces!

Is there a quick and standard-library-based approach?

No library is needed. An iteration can do the task ( if it is as simple as you describe).

string str = "add string \"this is a string with space!\"";

for( size_t i=0; i<str.length(); i++){

    char c = str[i];
    if( c == ' ' ){
        cout << endl;
    }else if(c == '\"' ){
        i++;
        while( str[i] != '\"' ){ cout << str[i]; i++; }
    }else{
        cout << c;
    }
}

that outputs

add
string
this is a string with space!

I wonder why this simple and C++ style solution is not presented here. It's based on fact that if we first split string by \" , then each even chunk is "inside" quotes, and each odd chunk should be additionally splitted by whitespaces.

No possibility for out_of_range or anything else.

unsigned counter = 0;
std::string segment;
std::stringstream stream_input(input);
while(std::getline(stream_input, segment, '\"'))
{
    ++counter;
    if (counter % 2 == 0)
    {
        if (!segment.empty())
            std::cout << segment << std::endl;
    }
    else
    {
        std::stringstream stream_segment(segment);
        while(std::getline(stream_segment, segment, ' '))
            if (!segment.empty())
                std::cout << segment << std::endl;
    }
}

Here is a complete function for it. Modify it according to need, it adds parts of string to a vector strings( qargs ).

void split_in_args(std::vector<std::string>& qargs, std::string command){
        int len = command.length();
        bool qot = false, sqot = false;
        int arglen;
        for(int i = 0; i < len; i++) {
                int start = i;
                if(command[i] == '\"') {
                        qot = true;
                }
                else if(command[i] == '\'') sqot = true;

                if(qot) {
                        i++;
                        start++;
                        while(i<len && command[i] != '\"')
                                i++;
                        if(i<len)
                                qot = false;
                        arglen = i-start;
                        i++;
                }
                else if(sqot) {
                        i++;
                        start++;
                        while(i<len && command[i] != '\'')
                                i++;
                        if(i<len)
                                sqot = false;
                        arglen = i-start;
                        i++;
                }
                else{
                        while(i<len && command[i]!=' ')
                                i++;
                        arglen = i-start;
                }
                qargs.push_back(command.substr(start, arglen));
        }
        for(int i=0;i<qargs.size();i++){
                std::cout<<qargs[i]<<std::endl;
        }
        std::cout<<qargs.size();
        if(qot || sqot) std::cout<<"One of the quotes is open\n";
}

The Boost library has a tokenizer class that can accept an escaped_list_separator . The combination of these look like they might provide what you are looking for.

Here are links to the boost documentation, current as of this post and almost certainly an old version by the time you read this.

https://www.boost.org/doc/libs/1_73_0/libs/tokenizer/doc/tokenizer.htm

https://www.boost.org/doc/libs/1_73_0/libs/tokenizer/doc/escaped_list_separator.htm

This example is stolen from the boost documentation. Forgive me for not creating my own example.

// simple_example_2.cpp
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
   tokenizer<escaped_list_separator<char> > tok(s);
   for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

I would define a class Token to read a single token from a stream.

Then using your code becomes very trivial.

#include <iostream>
#include <string>

int main()
{
    // Simply read the tokens from the stream.
    Token   t;
    while(std::cin >> t)
    {
        std::cout << "Got: " << t << "\n";
    }
}

Stream objects like this are very easy to write:

class Token
{
    // Just something to store the value in.
    std::string     value;

    // Then define the input and output operators.
    friend std::ostream& operator<<(std::ostream& str, Token const& output)
    {
        return str << output.value;
    }

    // Input is slightly harder than output.
    // but not that difficult to get correct.
    friend std::istream& operator>>(std::istream& str, Token& input)
    {
        std::string tmp;
        if (str >> tmp)
        {
            if (tmp[0] != '"')
            {
                // We read a word that did not start with
                // a quote mark. So we are done. Simply put
                // it in the destination.
                input.value = std::move(tmp);
            }
            else if (tmp.front() == '"' && tmp.back() == '"')
            {
                // we read a word with both open and close
                // braces so just nock these off.
                input.value = tmp.substr(1, tmp.size() - 2);
            }
            else
            {
                // We read a word that has but has a quote at the
                // start. So need to get all the characters upt
                // closing quote then add this to value.
                std::string tail;
                if (std::getline(str, tail, '"'))
                {
                    // Everything worked
                    // update the input
                    input.value = tmp.substr(1) + tail;
                }
            }
        }
        return str;
    }
};

There is a standard-library-based approach in C++14 or later. But it is not quick.

#include <iomanip> // quoted
#include <iostream>
#include <sstream> // stringstream
#include <string>

using namespace std;

int main(int argc, char **argv) {
    string str = "add string \"this is a string with spaces!\"";
    stringstream ss(str);
    string word;
    while (ss >> quoted(word)) {
        cout << word << endl;
    }
    return 0;
}

I guess there is no straight forward approach with standard library. Indirectly following algo will work:

a) search for '\"' with string::find('\"') . If anything found search for next '\"' using string::find('\'',prevIndex) , If found use string::substr() . Discard that part from the original string.

b) Now Serach for ' ' character in the same way.

NOTE: you have to iterate through the whole string.

Here is my solution, it's equivalent to python's shlex, shlex_join() is the inverse of shlex_split():

#include <cctype>
#include <iomanip>
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <vector>

// Splits the given string using POSIX shell-like syntax.
std::vector<std::string> shlex_split(const std::string& s)
{
  std::vector<std::string> result;

  std::string token;
  char quote{};
  bool escape{false};

  for (char c : s)
  {
    if (escape)
    {
      escape = false;
      if (quote && c != '\\' && c != quote)
        token += '\\';
      token += c;
    }
    else if (c == '\\')
    {
      escape = true;
    }
    else if (!quote && (c == '\'' || c == '\"'))
    {
      quote = c;
    }
    else if (quote && c == quote)
    {
      quote = '\0';
      if (token.empty())
        result.emplace_back();
    }
    else if (!isspace(c) || quote)
    {
      token += c;
    }
    else if (!token.empty())
    {
      result.push_back(std::move(token));
      token.clear();
    }
  }

  if (!token.empty())
  {
    result.push_back(std::move(token));
    token.clear();
  }

  return result;
}

// Concatenates the given token list into a string. This function is the
// inverse of shlex_split().
std::string shlex_join(const std::vector<std::string>& tokens)
{
  auto it = tokens.begin();
  if (it == tokens.end())
    return {};

  std::ostringstream oss;
  while (true)
  {
    if (it->empty() || it->find_first_of(R"( "\)") != std::string::npos)
      oss << std::quoted(*it);
    else
      oss << *it;

    if (++it != tokens.end())
      oss << ' ';
    else
      break;
  }
  return oss.str();
}

void test(const std::string& s, const char* expected = nullptr)
{
  if (!expected)
    expected = s.c_str();
  if (auto r = shlex_join(shlex_split(s)); r != expected)
    std::cerr << '[' << s << "] -> [" << r << "], expected [" << expected << "]\n";
}

int main()
{
  test("");
  test(" ", "");
  test("a");
  test(" a ", "a");
  test("a   b", "a b");
  test(R"(a \s b)", "a s b");
  test(R"("a a" b)");
  test(R"('a a' b)", R"("a a" b)");
  test(R"(a \" b)", R"(a "\"" b)");
  test(R"(a \\ b)", R"(a "\\" b)");

  test(R"("a \" a" b)");
  test(R"('a \' a' b)", R"("a ' a" b)");
  test(R"("a \\ a" b)");
  test(R"('a \\ a' b)", R"("a \\ a" b)");
  test(R"('a \s a' b)", R"("a \\s a" b)");
  test(R"("a \s a" b)", R"("a \\s a" b)");
  test(R"('a \" a' b)", R"("a \\\" a" b)");
  test(R"("a \' a" b)", R"("a \\' a" b)");

  test(R"("" a)");
  test(R"('' a)", R"("" a)");
  test(R"(a "")");
  test(R"(a '')", R"(a "")");
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM