忽略几个不同的词.. c ++？

Question

我正在阅读多个文档，并为阅读的单词建立索引。但是，我想忽略一些常见的单词（a，an，the，and，is，or，are等）。

有捷径可做吗？ 不仅仅是做...

if（word ==“ and” || word ==“ is” ||等等，等等）忽略单词；

例如，我可以以某种方式将它们放入const字符串中，然后仅对字符串进行检查吗？ 不确定...谢谢！

Answer 1

用您要排除的单词创建一个set<string> ，并使用mySet.count(word)确定单词是否在集合中。 如果是，则计数为1 ; 否则将为0 。

#include <iostream>
#include <set>
#include <string>
using namespace std;

int main() {
    const char *words[] = {"a", "an", "the"};
    set<string> wordSet(words, words+3);
    cerr << wordSet.count("the") << endl;
    cerr << wordSet.count("quick") << endl;
    return 0;
}

Answer 2

您可以使用字符串数组，在每个字符串之间进行循环和匹配，或者使用更优化的数据结构，例如set或trie。

这是一个使用普通数组的示例：

const char *commonWords[] = {"and", "is" ...};
int commonWordsLength = 2; // number of words in the array

for (int i = 0; i < commonWordsLength; ++i)
{
    if (!strcmp(word, commonWords[i]))
    {
        //ignore word;
        break;
    }
}

请注意，此示例未使用C ++ STL，但您应该使用。

Answer 3

如果要最大化性能，则应创建一个Trie。

http://en.wikipedia.org/wiki/特里

...停用词...

http://en.wikipedia.org/wiki/Stop_words

没有标准的C ++ trie数据结构，但是请参阅此问题以获取第三方实现...

尝试执行

如果您不愿意为此而烦恼，并且想使用标准容器，则最好使用的容器是unordered_set<string> ，它将停用词放在哈希表中。

bool filter(const string& word)
{
    static unordered_set<string> stopwords({"a", "an", "the"});
    return !stopwords.count(word);
}

忽略几个不同的词.. c ++？

问题描述

3 个解决方案

解决方案1
5 已采纳 2012-04-15 00:47:24

解决方案2
1 2012-04-15 00:47:05

解决方案3
0 2012-04-15 00:52:11

忽略几个不同的词.. c ++？

问题描述

3 个解决方案

解决方案1 5 已采纳 2012-04-15 00:47:24

解决方案2 1 2012-04-15 00:47:05

解决方案3 0 2012-04-15 00:52:11

解决方案1
5 已采纳 2012-04-15 00:47:24

解决方案2
1 2012-04-15 00:47:05

解决方案3
0 2012-04-15 00:52:11