I am using C++11 and am wondering what is the most elegant to process an existing C++ string such that it only contains these valid characters below. Efficiency is also a concern, but looking for elegance foremost.
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-";
Thank you, Virgil.
Here's my go:
void removeDisallowed(std::string& in) {
static const std::string allowed = "01234...";
in.erase(
std::remove_if(in.begin(), in.end(), [&](const char c) {
return allowed.find(c) == std::string::npos;
}),
in.end());
}
If you want to make it more efficient, you could make a set:
std::unordered_set<char> allowedSet(allowed.begin(), allowed.end());
And change the check to:
return !allowedSet.count(c);
[Update] Based on a lot of good comments and answers, I'd suggest just writing a:
template <typename F>
void erase_if(std::string& in, F func) {
in.erase(std::remove_if(in.begin(), in.end(), func));
}
And then actually trying to run it with all the various proposed func
s and see which one works best for your use-case. This won't work with Dietmar's answer, so you'll have to try that one separately, but they're probably all worth a shot.
It seems the most elegant approach would be the use of a regular expression (note the enclosing square brackets):
std::regex const filter("[^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-]");
str = std::regex_replace(str, filter, "");
Following the comments about performance I cooked up a quick benchmark which is checked in on github . It compares some of the suggestions. Here is a summary of the result run on a MacOS notebook with recent versions of gcc and clang using high optimization options. The numbers shown are the times taken in μs to process a lengthy text document:
benchmark gcc clang
regex (build 186131 552697
regex (prebuild) 177959 566353
use_remove_if_str_find 44802 40644
use_remove_if_find 88377 123237
use_remove_if_binary_search 54091 64065
use_remove_if_ctype 13818 12901
use_remove_if_hash 81341 58582
use_remove_if_table 9033 10203
The first two benchmarks use the regex approach posted above while the others use Barry's std::remove_if()
using different approaches to implement the predicate inside the lambda. To clarify the names an outline of what is done (inside a lambda, combined with erase()
as needed, etc):
text = std::regex_replace(text, std::regex("[^" + allowed + "]"), "")
text = std::regex_replace(text, filter, "")
(building the regex is outside the timing) std::remove_if(... a.find(c))
std::remove_if(... std::find(a.begin(), a.end(), c) == a.end())
std::remove_if(... std::binary_search(a.begin(), a.end(), c))
std::remove_if(... isalnum(c) || c == '-' || c == '_')
std::remove_if(... unordered_set.count(c))
std::remove_if(... table[c])
For details have a look at the source .
This may be simplistic but I would consider using a constant time lookup table that fits in a few cache lines.
void remove_disallowed(std::string &str)
{
static const char disallowed[] = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
};
str.erase(std::remove_if(str.begin(), str.end(), [&](char c) {
return disallowed[static_cast<unsigned char>(c)];
}), str.end());
}
#include <array>
#include <string>
#include <limits>
#include <iostream>
#include <algorithm>
#include <unordered_set>
void keep_chars_in_set(std::string &s, const std::unordered_set<char> &chars) {
s.erase(
std::remove_if(s.begin(), s.end(), [&chars](const char c) {
return !chars.count(c);
}),
s.end());
}
void keep_sorted_chars(std::string &s, const std::string &sorted_chars) {
s.erase(
std::remove_if(s.begin(), s.end(), [&sorted_chars](const char c) {
return !std::binary_search(sorted_chars.begin(), sorted_chars.end(), c);
}),
s.end());
}
using lookup_table = std::array<bool, std::numeric_limits<unsigned char>::max()>;
lookup_table make_lookup_table(const std::string &s) {
lookup_table t = {};
for (auto c : s) {
t[static_cast<size_t>(c)] = true;
}
return t;
}
void keep_chars_in_lookup_table(std::string &s, const lookup_table &table) {
s.erase(
std::remove_if(s.begin(), s.end(), [&table](const char c) {
return !table[static_cast<size_t>(c)];
}),
s.end());
}
int main() {
using namespace std;
string s1 = "abcdefxabc";
string s2 = "abcdefyabc";
string s3 = "abcdefzabc";
const unordered_set<char> set_of_chars = {'a', 'b', 'c', 'd', 'e', 'f'};
keep_chars_in_set(s1, set_of_chars);
cout << s1 << endl;
keep_sorted_chars(s2, "abcdef");
cout << s2 << endl;
const lookup_table &char_lookup_table = make_lookup_table("abcdef");
keep_chars_in_lookup_table(s3, char_lookup_table);
cout << s3 << endl;
}
Notes:
binary_search
should be faster than find
: O(lg N) vs O(N), full solutions being O(M lg N) vs O(M * N). unordered_set
is not a contiguous data structure, so, even if search is O(1) (with full solution (O(M))), it may not be cache friendly, and hence you should profile.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.