简体   繁体   English

如何将列表简化为最不常见的字符串?

[英]How to boil a list down to least common strings?

I have a HashSet<string> which I'm loading vulgar words into for filtering purposes. 我有一个HashSet<string>我正在加载粗俗的单词以进行过滤。 The problem is that my list will contain "Fu" and also the word spelled out completely. 问题是我的列表将包含“Fu”以及完全拼写的单词。 What I want to do is filter the list down so it only contains "Fu", which would eliminate any other forms of the word from the list. 我想要做的是过滤列表,使它只包含“Fu”,这将消除列表中任何其他形式的单词。

In other words, I want to remove all strings in the list where its substring is also a list item. 换句话说,我想删除列表中的所有字符串,其子字符串也是列表项。

How should I go about doing this? 我该怎么做呢?

I have the following where excludedWords is the original HashSet , but it's not working completely: 我有以下内容,其中excludedWords是原始的HashSet ,但它不完全正常工作:

HashSet<string> copy = new HashSet<string>(exludedWords);

foreach (string w in copy)
{
    foreach (string s in copy)
    {
        if (w.Contains(s) && w.Length > s.Length)
        {
            result.Remove(w);
        }
    }
}

You should compare every word in the set to every other (distinctly different) word in the set. 您应该将集合中的每个单词与集合中的每个其他(明显不同的)单词进行比较。 You can accomplish this as follows (although I'm sure this is not the most efficient method, by any means): 您可以按如下方式完成此操作(尽管我确信这不是最有效的方法,无论如何):

string[] strings = { "a", "aa", "aaa", "b", "bb", "bbb", "c", "cc", "ccc" };
List<string> results = new List<string>(strings);

foreach (string str1 in strings) {
  foreach (string str2 in strings) {
    if (str1 != str2) {
      if (str2.Contains(str1)) {
        results.Remove(str2);
      }
    }
  }
}

return results;

Here's one way... 这是一种方式......

filter.RemoveAll(a => filter.Any(b => b != a && a.Contains(b)));

Where filter is a List and pre-populated with the filter strings. 其中filter是List并预先填充了过滤器字符串。

Edit: Didn't see that you wanted Contains instead of starts with. 编辑:没有看到你想要包含而不是开始。 so made the necessary mod. 所以做了必要的mod。

Assuming you just want to throw away the longer values you could just use an IEqualityComparer<string> implementation to get the new set. 假设您只想丢弃较长的值,可以使用IEqualityComparer<string>实现来获取新的集合。

private class ShortestSubStringComparer : IComparer<string>, IEqualityComparer<string>
{
    public int Compare(string x, string y)
    {
        if (x == null) return (y == null) ? 0 : -1;
        if (y == null) return 1;

        Debug.Assert(x != null && y != null);
        if (this.Equals(x, y)) return x.Length.CompareTo(y.Length);
        return StringComparer.CurrentCulture.Compare(x, y);
    }

    public bool Equals(string x, string y)
    {
        if (x == null) return y == null;
        if (x.StartsWith(y)) return true;
        if (y != null && y.StartsWith(x)) return true;
        return false;
    }

    public int GetHashCode(string obj)
    {
        return obj.GetHashCode();
    }
}

And then your function can use the GroupBy function to group and select the first ordered item like so: 然后你的函数可以使用GroupBy函数来分组并选择第一个有序项,如下所示:

public HashSet<string> FindShortestSubString(HashSet<string> set)
{
    var comparer = new ShortestSubStringComparer();
    return new HashSet<string>(set.GroupBy(e => e, comparer).Select(g => g.OrderBy(e => e, comparer).First()));
}

Or possibly Min might do the trick (meaning you don't need the IComparer<string> implementation either)... 或者可能Min可能会做到这一点(意味着你也不需要IComparer<string>实现)......

public HashSet<string> FindShortestSubString(HashSet<string> set)
{
    var comparer = new ShortestSubStringComparer();
    return new HashSet<string>(set.GroupBy(e => e, comparer).Select(g => g.Min(e => e)));
}

I would advise against this type of filtering. 我建议不要使用这种类型的过滤。 You may save some cpu cycles but you'll get some unintended consequences that may really confuse your users (or just make them plain mad) 你可以节省一些cpu周期,但你会得到一些意想不到的后果,可能真的让你的用户感到困惑(或者只是让他们生气)

For example, let's assume that this is you list of vulgar words... 例如,让我们假设这是你粗俗词汇的列表......

foo bar foohead foolery foo bar foohead foolery

You want to filter out all of these words from some content. 您想要从某些内容中过滤掉所有这些单词。 To be efficient you remove foohead and foolery and just filter on the substring foo. 为了提高效率,你可以删除foohead和foolery,然后只对子字符串foo进行过滤。

You're going to filter innocuous words that contain foo but weren't in your orignal vulgar list. 你要过滤包含foo但不在你的orignal粗俗列表中的无害单词。

reminds me of this recent Daily WTF... (second one down) 让我想起最近的每日WTF ...(第二次下来)

http://thedailywtf.com/Articles/Progree-of-enail-Status.aspx http://thedailywtf.com/Articles/Progree-of-enail-Status.aspx

You could use Regular Expressions. 您可以使用正则表达式。 This is in vb but I'm sure you can convert it. 这是在vb但我相信你可以转换它。

Example: 例:

Imports System.Text.RegularExpressions
Public Class Form1

Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        Dim InputString As String
        InputString = Regex.Replace(WHAT THE USER HAS ENTERED, "fu", "**")
    End Sub
End Class

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM