简体   繁体   English

在一组字符串中查找超级字符串

[英]Find super-string in a set of strings

I have a list of strings, like:我有一个字符串列表,例如:

cargo
cargo pants
cargo pants men buy
cargo pants men
cargo pants men melbourne buy

In this, the string that contains all remaining strings is cargo pants men melbourne buy .在此,包含所有剩余字符串的字符串是cargo pants men melbourne buy I'd like to remove all the shorter strings and preserve only the longest "super string".我想删除所有较短的字符串,只保留最长的“超级字符串”。

Note, if 2 queries cargo pants and cargo shorts exist, they will be treated as 2 different queries and won't be combined.请注意,如果存在 2 个查询cargo pantscargo shorts ,它们将被视为 2 个不同的查询,不会合并。

So far, I've been doing this the brute force way - pick a string from set and walk through the same set deleting all other strings that are "substrings" of the current string.到目前为止,我一直在以蛮力的方式执行此操作 - 从集合中选择一个字符串并遍历相同的集合,删除作为当前字符串“子字符串”的所有其他字符串。 Roughly,大致,

for (String p: big_set) {
    for (String q: big_set) {
        if (!p.equals(q)) {
            if (has_all_words(p, q)) { /* If all words in 'p' is also in 'q' */
                big_set.remove(p);
                break;
            }
        }
    }
}

Is there an intelligent algorithm to do this in less than O(n^2) time?是否有一种智能算法可以在小于 O(n^2) 的时间内做到这一点? In this function, has_all_words will preserve the order of words while comparing.在这个函数中, has_all_words将在比较时保留单词的顺序。

For the curious, I have a massive list of a few billion search queries (like the ones send to Google/Yahoo/Bing) and I'm trying to find hypernyms for these queries.出于好奇,我有一个包含数十亿个搜索查询(例如发送到 Google/Yahoo/Bing 的查询)的庞大列表,我正在尝试为这些查询找到上位词。 There's a server that parses this string and produces various interesting categories.有一个服务器可以解析这个字符串并生成各种有趣的类别。 I am trying to compress the queries list in the hopes of minimizing compute cost and bandwidth.我正在尝试压缩查询列表,以期最大限度地减少计算成本和带宽。 This method surely reduces bandwidth significantly (because humans can't just think of buy cargo pants melbourne in one go), but the pre-computation cost is prohibitive.这种方法肯定会显着降低带宽(因为人类不能buy cargo pants melbourne想着buy cargo pants melbourne ),但预先计算的成本令人望而却步。 And so I've been hunting for algorithms that can do this, but I haven't come across anything that does this yet.所以我一直在寻找可以做到这一点的算法,但我还没有遇到任何可以做到这一点的算法。

  • I think all you want to ask for is to remove all those sub strings which can be found in a super string .Like in the case for ["foo bar", "foo baz"] you will have to store both the strings .我认为您想要的只是删除可以在超级字符串中找到的所有子字符串。就像在 ["foo bar", "foo baz"] 的情况下一样,您必须存储这两个字符串。

  • If my guess is right then yes you can achieve it in less than O(n^2).如果我的猜测是对的,那么是的,您可以在不到 O(n^2) 的时间内实现它。 before starting with anything short each super-strings alphabetically so that no such case remains like cargo pants pants cargo men buy在开始之前按字母顺序排列每个超级字符串,这样就不会像工装裤裤装货男买这样的情况了

  • first, sort your string in decreasing order according to there首先,根据那里按降序对字符串进行排序
    lengths.长度。 Then pick up sub strings of the longest string (as we are然后拿起最长字符串的子字符串(就像我们一样
    iterating from first index and have sorted in reverse order) and从第一个索引开始迭代并按相反顺序排序)和
    start searching for it in rest of the strings.开始在其余的字符串中搜索它。

  • If string is found remove it and Once searching and removing completes just iterate again with the next sub string of the same super-string with the last sub-string included.如果找到字符串,将其删除,一旦搜索和删除完成,只需使用包含最后一个子字符串的同一超级字符串的下一个子字符串再次迭代。

  • In the end you will be left with only strings which are unique (if you consider ["foo bar", "foo baz"] as a unique string.最后,您将只剩下唯一的字符串(如果您将 ["foo bar", "foo baz"] 视为唯一字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM