简体   繁体   English

如何从STL容器中通过子字符串删除元素

[英]How to remove elements by substrings from a STL container

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string) 我有一个对象向量(对象是术语节点,在其他字段中包含带有术语字符串的字符串字段)

class TermNode {
private:
    std::wstring term;
    double weight;
    ...
public:
    ...
};

After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as 经过一些处理和计算分数后,这些对象最终存储在TermNode指针的向量中,例如

std::vector<TermNode *> termlist;

A resulting list of this vector, containing up to 400 entries, looks like this: 此向量的结果列表最多包含400个条目,如下所示:

DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...

What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. 我想要做的是清理最终列表中的子字符串 ,这些子字符串也包含在术语列表中的短语中。 For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery' , because they are also in the list and redundant in this context. 例如,在上面的列表片段中,有一个短语“知识发现” ,因此我想删除单个术语“知识”“发现” ,因为它们也在列表中并且在此上下文中是多余的。 I want to keep the phrases containing the single terms. 我想保留包含单个术语的短语。 I am also thinking about to remove all strings equal or less 3 characters. 我也在考虑删除所有等于或少于3个字符的字符串。 But that is just a thought for now. 但这只是目前的想法。

For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class. 对于此清理过程,我想使用remove_if / find_if(使用新的C ++ lambdas)编码一个类,并且将该代码放在一个紧凑的类中会很好。

I am not really sure on how to solve this. 我不确定如何解决这个问题。 The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. 问题是,我首先必须通过将标记设置为删除标记来识别要删除的字符串。 That would mean I would have to pre-process that list. 那意味着我将不得不对该列表进行预处理。 I would have to find the single terms and the phrases that contain one of those single terms. 我将不得不找到单个术语以及包含这些单个术语之一的短语。 I think that is not an easy task to do and would need some advanced algorithm. 我认为这并非易事,需要一些高级算法。 Using a suffix tree to identify substrings? 使用后缀树来标识子字符串?

Another loop on the vector and maybe a copy of the same vector could to the clean up. 向量上的另一个循环以及同一向量的副本可能需要清理。 I am looking for something most efficient in a time manner. 我正在寻找一种及时有效的方法。

I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector? 我一直在使用remove_if / find_if和std :: list删除不兼容的迭代器中所示的想法或方向,以及从std :: vector擦除多个对象中使用的想法 .

So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? 因此,问题在于,基本上有一种聪明的方法可以做到这一点,并且可以避免出现多个循环,并且我如何确定要删除的单个术语? Maybe I am really missing something, but probably someone is out there and give me a good hint. 也许我真的很想念什么,但可能有人在外面给我一个很好的提示。

Thanks for your thoughts! 感谢您的想法!

Update 更新

I implemented the removal of redundant single terms the way Scrubbins recommended as follows: 我采用Scrubbins建议的方式实现了删除冗余单项的操作,如下所示:

/**
 * Functor gets the term of each TermNode object, looks if term string
 * contains spaces (ie. term is a phrase), splits phrase by spaces and finally
 * stores thes term tokens into a set. Only term higher than a score of 
 * 'skipAtWeight" are taken tinto account.
 */
struct findPhrasesAndSplitIntoTokens {
private:
    set<wstring> tokens;
    double skipAtWeight;

public:
    findPhrasesAndSplitIntoTokens(const double skipAtWeight)
    : skipAtWeight(skipAtWeight) {
    }

    /**
     * Implements operator()
     */
    void operator()(const TermNode * tn) {
        // --- skip all terms lower skipAtWeight
        if (tn->getWeight() < skipAtWeight)
            return;

        // --- get term
        wstring term = tn->getTerm();
        // --- iterate over term, check for spaces (if this term is a phrase)
        for (unsigned int i = 0; i < term.length(); i++) {
            if (isspace(term.at(i))) {
if (0) {
                wcout << "input term=" << term << endl;
}
                // --- simply tokenze term by space and store tokens into 
                // --- the tokens set
                // --- TODO: check if this really is UTF-8 aware, esp. for
                // --- strings containing umlauts, etc  !!
                wistringstream iss(term);
                copy(istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(iss),
                    istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(),
                    inserter(tokens, tokens.begin()));
if (0) {
                wcout << "size of token set=" << tokens.size() << endl;
                for_each(tokens.begin(), tokens.end(), printSingleToken());
}
            }
        }
    }

    /**
     * return set of extracted tokens
     */
    set<wstring> getTokens() const {
        return tokens;
    }
};

/**
 * Functor to find terms in tokens set
 */
class removeTermIfInPhraseTokensSet {
private:
    set<wstring> tokens;

public:
    removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
    : tokens(termTokens) {
    }

    /**
     * Implements operator()
     */
    bool operator()(const TermNode * tn) const {
        if (tokens.find(tn->getTerm()) != tokens.end()) {
            return true;
        }
        return false;
    }
};

...

findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
    termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());

// --- remove all extracted single tokens from the final terms list
// --- of similar search terms 
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
    remove_if(
        termList.begin(), termList.end(), removeTermIfFound),
    termList.end()
);

for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
      tl_iter != termList.end(); tl_iter++) {
    wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
    if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}

...

I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. 我无法使用C ++ 11 lambda语法,因为在我的ubuntu服务器上当前安装了g ++ 4.4.1。 Anyways. 无论如何。 It does the job for now. 它现在可以完成工作。 The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. 可行的方法是与其他搜索结果集一起检查所得加权术语的质量,并查看如何提高质量,并找到一种方法来结合原始查询术语来提高相关性。 It might be not an easy task to do, I wish there would be some "simple heuristics". 这可能不是一件容易的事,我希望会有一些“简单的启发式”。 But that might be another new question when stepped further a little more :-) 但这可能是另一个新问题,当进一步执行更多操作时:-)

So thanks to all for this rich contribution of thoughts! 因此,感谢大家对思想的丰富贡献!

What you need to do is first, iterate through the list and split up all the multi-word values into single words. 您需要做的是,首先遍历列表,然后将所有多单词值拆分为单个单词。 If you're allowing Unicode, this means you will need something akin to ICU's BreakIterators, else you can go with a simple punctuation/whitespace split. 如果您允许使用Unicode,则意味着您将需要类似于ICU的BreakIterators的内容,否则您可以进行简单的标点/空格分隔。 When each string is split into it's constituent words, then use a hash map to keep a list of all the current words. 将每个字符串拆分为组成词后,请使用哈希图保留所有当前词的列表。 When you reach a multi-word value, then you can check if it's words have already been found. 当您达到多字值时,可以检查是否已找到该字。 This should be the simplest way to identify duplicates. 这应该是识别重复项的最简单方法。

I can suggest you to use the "erase-remove" idiom in this way: 我可以建议您以这种方式使用“删除”惯用语:

struct YourConditionFunctor {
    bool operator()(TermNode* term) {
        if (/* you have to remove term */) {
           delete term;
           return true;
        }
        return false;
    }
};

and then write: 然后写:

termlist.erase(
    remove_if(
        termlist.begin(),
        termlist.end(), 
        YourConditionFunctor()
    ), 
    termlist.end()
);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM