Most elegant and efficient way to remove disallowed characters from a C++ string?

Question

I am using C++11 and am wondering what is the most elegant to process an existing C++ string such that it only contains these valid characters below. Efficiency is also a concern, but looking for elegance foremost.

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-";

Thank you, Virgil.

Answer 1

Here's my go:

void removeDisallowed(std::string& in) {
    static const std::string allowed = "01234...";
    in.erase(
        std::remove_if(in.begin(), in.end(), [&](const char c) {
            return allowed.find(c) == std::string::npos;
        }),
        in.end());
}

If you want to make it more efficient, you could make a set:

std::unordered_set<char> allowedSet(allowed.begin(), allowed.end());

And change the check to:

return !allowedSet.count(c);

[Update] Based on a lot of good comments and answers, I'd suggest just writing a:

template <typename F>
void erase_if(std::string& in, F func) {
    in.erase(std::remove_if(in.begin(), in.end(), func));
}

And then actually trying to run it with all the various proposed func s and see which one works best for your use-case. This won't work with Dietmar's answer, so you'll have to try that one separately, but they're probably all worth a shot.

Answer 2

It seems the most elegant approach would be the use of a regular expression (note the enclosing square brackets):

std::regex const filter("[^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-]");
str = std::regex_replace(str, filter, "");

Following the comments about performance I cooked up a quick benchmark which is checked in on github . It compares some of the suggestions. Here is a summary of the result run on a MacOS notebook with recent versions of gcc and clang using high optimization options. The numbers shown are the times taken in μs to process a lengthy text document:

benchmark                         gcc      clang
regex (build                     186131    552697
regex (prebuild)                 177959    566353
use_remove_if_str_find            44802     40644
use_remove_if_find                88377    123237
use_remove_if_binary_search       54091     64065
use_remove_if_ctype               13818     12901
use_remove_if_hash                81341     58582
use_remove_if_table                9033     10203

The first two benchmarks use the regex approach posted above while the others use Barry's std::remove_if() using different approaches to implement the predicate inside the lambda. To clarify the names an outline of what is done (inside a lambda, combined with erase() as needed, etc):

regex (build): text = std::regex_replace(text, std::regex("[^" + allowed + "]"), "")
regex (prebuild): text = std::regex_replace(text, filter, "") (building the regex is outside the timing)
remove_if str find: std::remove_if(... a.find(c))
remove_if find: std::remove_if(... std::find(a.begin(), a.end(), c) == a.end())
remove_if binary_search: std::remove_if(... std::binary_search(a.begin(), a.end(), c))
remove_if ctype: std::remove_if(... isalnum(c) || c == '-' || c == '_')
remove_if hash: std::remove_if(... unordered_set.count(c))
remove_if table: std::remove_if(... table[c])

For details have a look at the source .

Answer 3

This may be simplistic but I would consider using a constant time lookup table that fits in a few cache lines.

void remove_disallowed(std::string &str)
{
    static const char disallowed[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
    };
    str.erase(std::remove_if(str.begin(), str.end(), [&](char c) {
        return disallowed[static_cast<unsigned char>(c)];
    }), str.end());
}

Answer 4

#include <array>
#include <string>
#include <limits>
#include <iostream>
#include <algorithm>
#include <unordered_set>

void keep_chars_in_set(std::string &s, const std::unordered_set<char> &chars) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&chars](const char c) {
            return !chars.count(c);
        }),
        s.end());
}

void keep_sorted_chars(std::string &s, const std::string &sorted_chars) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&sorted_chars](const char c) {
            return !std::binary_search(sorted_chars.begin(), sorted_chars.end(), c);
        }),
        s.end());
}

using lookup_table = std::array<bool, std::numeric_limits<unsigned char>::max()>;

lookup_table make_lookup_table(const std::string &s) {
    lookup_table t = {};
    for (auto c : s) {
        t[static_cast<size_t>(c)] = true;
    }
    return t;
}

void keep_chars_in_lookup_table(std::string &s, const lookup_table &table) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&table](const char c) {
            return !table[static_cast<size_t>(c)];
        }),
        s.end());
}

int main() {
    using namespace std;

    string s1 = "abcdefxabc";
    string s2 = "abcdefyabc";
    string s3 = "abcdefzabc";

    const unordered_set<char> set_of_chars = {'a', 'b', 'c', 'd', 'e', 'f'};
    keep_chars_in_set(s1, set_of_chars);
    cout << s1 << endl;

    keep_sorted_chars(s2, "abcdef");
    cout << s2 << endl;

    const lookup_table &char_lookup_table = make_lookup_table("abcdef");
    keep_chars_in_lookup_table(s3, char_lookup_table);
    cout << s3 << endl;
}

Notes:

binary_search should be faster than find : O(lg N) vs O(N), full solutions being O(M lg N) vs O(M * N).
unordered_set is not a contiguous data structure, so, even if search is O(1) (with full solution (O(M))), it may not be cache friendly, and hence you should profile.
lookup table method should be the faster one, it's both cache friendly and with less complexity (O(M)).
This draws from several other answers.

Most elegant and efficient way to remove disallowed characters from a C++ string?

Question

4 answers

solution1
9 2014-11-07 19:26:38

solution2
7 2014-11-07 19:52:42

solution3
6 2014-11-07 20:08:18

solution4
2 2014-11-07 19:58:14

Most elegant and efficient way to remove disallowed characters from a C++ string?

Question

4 answers

solution1 9 2014-11-07 19:26:38

solution2 7 2014-11-07 19:52:42

solution3 6 2014-11-07 20:08:18

solution4 2 2014-11-07 19:58:14

solution1
9 2014-11-07 19:26:38

solution2
7 2014-11-07 19:52:42

solution3
6 2014-11-07 20:08:18

solution4
2 2014-11-07 19:58:14