从C ++字符串中删除不允许的字符的最优雅和有效的方法？

Question

I am using C++11 and am wondering what is the most elegant to process an existing C++ string such that it only contains these valid characters below. 我正在使用C ++ 11，我想知道处理现有C ++字符串最优雅的是什么，它只包含下面这些有效字符。 Efficiency is also a concern, but looking for elegance foremost. 效率也是一个问题，但最重要的是寻找优雅。

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-"; “0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-”;

Thank you, Virgil. 谢谢你，维吉尔。

Answer 1

Here's my go: 这是我的去处：

void removeDisallowed(std::string& in) {
    static const std::string allowed = "01234...";
    in.erase(
        std::remove_if(in.begin(), in.end(), [&](const char c) {
            return allowed.find(c) == std::string::npos;
        }),
        in.end());
}

If you want to make it more efficient, you could make a set: 如果你想提高效率，可以制作一套：

std::unordered_set<char> allowedSet(allowed.begin(), allowed.end());

And change the check to: 并将支票更改为：

return !allowedSet.count(c);

[Update] Based on a lot of good comments and answers, I'd suggest just writing a: [更新]基于很多好的评论和答案，我建议只写一个：

template <typename F>
void erase_if(std::string& in, F func) {
    in.erase(std::remove_if(in.begin(), in.end(), func));
}

And then actually trying to run it with all the various proposed func s and see which one works best for your use-case. 然后实际上尝试使用所有各种提议的func来运行它，并查看哪一个最适合您的用例。 This won't work with Dietmar's answer, so you'll have to try that one separately, but they're probably all worth a shot. 这对Dietmar的答案不起作用，所以你必须单独尝试一下，但它们可能都值得一试。

Answer 2

It seems the most elegant approach would be the use of a regular expression (note the enclosing square brackets): 似乎最优雅的方法是使用正则表达式（注意括号方括号）：

std::regex const filter("[^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-]");
str = std::regex_replace(str, filter, "");

Following the comments about performance I cooked up a quick benchmark which is checked in on github . 在关于性能的评论之后，我编写了一个快速基准测试，在github上进行了检查。 It compares some of the suggestions. 它比较了一些建议。 Here is a summary of the result run on a MacOS notebook with recent versions of gcc and clang using high optimization options. 以下是使用高优化选项的最新版gcc和clang在MacOS笔记本上运行的结果的摘要。 The numbers shown are the times taken in μs to process a lengthy text document: 显示的数字是以μs为单位处理冗长文本文档所用的时间：

benchmark                         gcc      clang
regex (build                     186131    552697
regex (prebuild)                 177959    566353
use_remove_if_str_find            44802     40644
use_remove_if_find                88377    123237
use_remove_if_binary_search       54091     64065
use_remove_if_ctype               13818     12901
use_remove_if_hash                81341     58582
use_remove_if_table                9033     10203

The first two benchmarks use the regex approach posted above while the others use Barry's std::remove_if() using different approaches to implement the predicate inside the lambda. 前两个基准测试使用上面发布的正则表达式方法，而其他基准测试使用Barry的std::remove_if()使用不同的方法在lambda中实现谓词。 To clarify the names an outline of what is done (inside a lambda, combined with erase() as needed, etc): 为了澄清名称，概述了所做的事情（在lambda中，根据需要结合erase()等）：

regex (build): text = std::regex_replace(text, std::regex("[^" + allowed + "]"), "") regex（build）： text = std::regex_replace(text, std::regex("[^" + allowed + "]"), "")
regex (prebuild): text = std::regex_replace(text, filter, "") (building the regex is outside the timing) 正则表达式（prebuild）： text = std::regex_replace(text, filter, "") （构建正则表达式在时间之外）
remove_if str find: std::remove_if(... a.find(c)) remove_if str find： std::remove_if(... a.find(c))
remove_if find: std::remove_if(... std::find(a.begin(), a.end(), c) == a.end()) remove_if find： std::remove_if(... std::find(a.begin(), a.end(), c) == a.end())
remove_if binary_search: std::remove_if(... std::binary_search(a.begin(), a.end(), c)) remove_if binary_search： std::remove_if(... std::binary_search(a.begin(), a.end(), c))
remove_if ctype: std::remove_if(... isalnum(c) || c == '-' || c == '_') remove_if ctype： std::remove_if(... isalnum(c) || c == '-' || c == '_')
remove_if hash: std::remove_if(... unordered_set.count(c)) remove_if hash： std::remove_if(... unordered_set.count(c))
remove_if table: std::remove_if(... table[c]) remove_if table： std::remove_if(... table[c])

For details have a look at the source . 有关详细信息，请查看源代码。

Answer 3

This may be simplistic but I would consider using a constant time lookup table that fits in a few cache lines. 这可能是简单的，但我会考虑使用适合几个缓存行的常量时间查找表。

void remove_disallowed(std::string &str)
{
    static const char disallowed[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
    };
    str.erase(std::remove_if(str.begin(), str.end(), [&](char c) {
        return disallowed[static_cast<unsigned char>(c)];
    }), str.end());
}

Answer 4

#include <array>
#include <string>
#include <limits>
#include <iostream>
#include <algorithm>
#include <unordered_set>

void keep_chars_in_set(std::string &s, const std::unordered_set<char> &chars) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&chars](const char c) {
            return !chars.count(c);
        }),
        s.end());
}

void keep_sorted_chars(std::string &s, const std::string &sorted_chars) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&sorted_chars](const char c) {
            return !std::binary_search(sorted_chars.begin(), sorted_chars.end(), c);
        }),
        s.end());
}

using lookup_table = std::array<bool, std::numeric_limits<unsigned char>::max()>;

lookup_table make_lookup_table(const std::string &s) {
    lookup_table t = {};
    for (auto c : s) {
        t[static_cast<size_t>(c)] = true;
    }
    return t;
}

void keep_chars_in_lookup_table(std::string &s, const lookup_table &table) {
    s.erase(
        std::remove_if(s.begin(), s.end(), [&table](const char c) {
            return !table[static_cast<size_t>(c)];
        }),
        s.end());
}

int main() {
    using namespace std;

    string s1 = "abcdefxabc";
    string s2 = "abcdefyabc";
    string s3 = "abcdefzabc";

    const unordered_set<char> set_of_chars = {'a', 'b', 'c', 'd', 'e', 'f'};
    keep_chars_in_set(s1, set_of_chars);
    cout << s1 << endl;

    keep_sorted_chars(s2, "abcdef");
    cout << s2 << endl;

    const lookup_table &char_lookup_table = make_lookup_table("abcdef");
    keep_chars_in_lookup_table(s3, char_lookup_table);
    cout << s3 << endl;
}

Notes: 笔记：

binary_search should be faster than find : O(lg N) vs O(N), full solutions being O(M lg N) vs O(M * N). binary_search应该比find更快：O（lg N）vs O（N），完整解是O（M lg N）vs O（M * N）。
unordered_set is not a contiguous data structure, so, even if search is O(1) (with full solution (O(M))), it may not be cache friendly, and hence you should profile. unordered_set不是连续的数据结构，因此，即使搜索是O（1）（使用完整解（O（M））），它也可能不是缓存友好的，因此您应该进行分析。
lookup table method should be the faster one, it's both cache friendly and with less complexity (O(M)). 查找表方法应该更快，它既缓存友好又复杂度较低（O（M））。
This draws from several other answers. 这得出了其他几个答案。

从C ++字符串中删除不允许的字符的最优雅和有效的方法？

问题描述

4 个解决方案

解决方案1
9 2014-11-07 19:26:38

解决方案2
7 2014-11-07 19:52:42

解决方案3
6 2014-11-07 20:08:18

解决方案4
2 2014-11-07 19:58:14

从C ++字符串中删除不允许的字符的最优雅和有效的方法？

问题描述

4 个解决方案

解决方案1 9 2014-11-07 19:26:38

解决方案2 7 2014-11-07 19:52:42

解决方案3 6 2014-11-07 20:08:18

解决方案4 2 2014-11-07 19:58:14

解决方案1
9 2014-11-07 19:26:38

解决方案2
7 2014-11-07 19:52:42

解决方案3
6 2014-11-07 20:08:18

解决方案4
2 2014-11-07 19:58:14