简体   繁体   中英

Splitting string with multiple delimiters, allowing quoted values

The docs for boost::escaped_list_separator provide the following explanation for the second parameter c :

Any character in the string c, is considered to be a separator.

So, I need to split the string with multiple separators, allowing the quoted values, which can contain these separators within:

#include <iostream>
#include <string>

#include <boost/tokenizer.hpp>

int main() {
    std::wstring str = L"2   , 14   33  50   \"AAA BBB\"";

    std::wstring escSep(L"\\"); //escape character
    std::wstring delim(L" \t\r\n,"); //split on spaces, tabs, new lines, commas
    std::wstring quotes(L"\""); //allow double-quoted values with delimiters within

    boost::escaped_list_separator<wchar_t> separator(escSep, delim, quotes);
    boost::tokenizer<boost::escaped_list_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tok(str, separator);

    for(auto beg=tok.begin(); beg!=tok.end();++beg)
        std::wcout << *beg << std::endl;

    return 0;
}

The expected result would be [2; 14; 33; 50; AAA BBB]. However, his code results in bunch of empty tokens:

在此处输入图片说明

Regular boost::char_separator omits all these empty tokens, considering all delimiters. It seems that boost::escaped_list_separator also considers all specified delimiters, but produces empty values. Is it true that if multiple consecutive delimiters are encountered, it will produce empty tokens? Is there any way to avoid this?

If it's always true, that only empty tokens are produced, it's easy to test the resulting values and omit them manually. But, it can get pretty ugly. For example, imagine strings each with 2 actual values and possibly with many tabs AND spaces separating the values. Then specifying delimiters as L"\\t " (ie space and tab) will work, but produce a ton of empty tokens.

Judging by the Boost Tokenizer documentation, you are indeed correct in assuming that if multiple consecutive delimiters are encountered empty tokens will be produced when using boost::escaped_list_separator . Unlike boost::char_separator , boost::escaped_list_separator does not provide any constructor that allows you to pass in whether to keep or discard any empty tokens produced.

While having the option to discard empty tokens can be nice, when you consider the use case (parsing CSV files) presented in the documentation ( http://www.boost.org/doc/libs/1_64_0/libs/tokenizer/escaped_list_separator.htm ), keeping empty tokens makes perfect sense. An empty field is still a field.

One option is to simply discard empty tokens after tokenizing. If the generation of empty tokens concerns you, an alternative is removing repeated delimiters prior to passing it to the tokenizer, but obviously you will need to take care not to remove anything inside quotes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM