简体   繁体   English

优化string.Replace方法

[英]Optimize string.Replace method

I have a list of 200+ words that are not allowed on a website. 我列出了网站上不允许使用的200多个单词。 The string.Replace method below takes ~80ms. 下面的string.Replace方法大约需要string.Replace If I increase s < 1000 by a factor of 10.00 to s < 10,000 this delay goes to ~834ms, a 10.43 increase. 如果我将s < 1000乘以10.00到s < 10,000此延迟将达到〜834ms,即增加10.43。 I am woried about the scalability of this function, especially if the list increases in size. 我担心此功能的可伸缩性,特别是如果列表的大小增加时。 I was told strings are immutable and text.Replace() is creating 200 new strings in memory. 有人告诉我字符串是不可变的,并且text.Replace()在内存中创建了200个新字符串。 Is there something similar to a Stringbuilder for this? 是否有类似于Stringbuilder东西?

List<string> FilteredWords = new List<string>();
FilteredWords.Add("RED");
FilteredWords.Add("GREEN");
FilteredWords.Add("BLACK");
for (int i = 1; i < 200; i++)
{ FilteredWords.Add("STRING " + i.ToString()); }

string text = "";

//simulate a large dynamically generated html page
for (int s = 1; s < 1000; s++)
{ text += @"Lorem ipsum dolor sit amet, minim BLACK cetero cu nam.
            No vix platonem sententiae, pro wisi congue graecis id, GREEN assum interesset in vix.
            Eum tamquam RED pertinacia ex."; }

// This is the function I seek to optimize
foreach (string s in FilteredWords)
{ text = text.Replace(s, "[REMOVED]"); }

If you expect most of the text to be relatively nice than scanning whole text first for matching words could be better approach. 如果您希望大多数文本比首先扫描整个文本以查找匹配的单词相对好,那是更好的方法。 You can also normalize words text at the same time to catch some standard replacements. 您还可以同时规范化单词文本,以捕获一些标准替代品。

Ie scan string by matching individual words (ie Regular expression like "\\w+" ), than for each detected word lookup (potentially normalized value) in dictionary of words to replace. 也就是说,通过匹配单个单词(即正则表达式,如"\\w+" )来扫描字符串,而不是替换要替换的单词字典中的每个检测到的单词查找(潜在的归一化值)。

You can either simply scan first to get list of "words to replace" and than just replace individual word later, or scan and build resulting string at the same time (using StringBuilder or StreamWriter , obviously not String.Concat / + ). 您可以先进行扫描以获取“要替换的单词”列表,然后再进行替换,或者同时扫描并生成结果字符串(使用StringBuilderStreamWriter ,显然不是String.Concat / + )。

Note: Unicode provides large number of good characters to use, so don't expect your effort to be very successful. 注意:Unicode提供了大量可使用的好字符,因此不要指望您的工作会非常成功。 Ie try to find "cool" in following text: "you are сооl". 即,尝试在以下文本中找到“酷”:“您是сооl”。

Sample code (relying on Regex.Replace for tokenization and building the string and HashSet for matches). 示例代码(依靠Regex.Replace进行标记化并构建字符串和HashSet进行匹配)。

var toFind = FilteredWords.Aggregate(
      new HashSet<string>(), (c, i) => { c.Add(i); return c;});

text = new Regex(@"\w+")
   .Replace(text, m => toFind.Contains(m.Value) ? "[REMOVED]" : m.Value));

Use StringBuilder.Replace and try to do it as a batch operation. 使用StringBuilder.Replace并尝试将其作为批处理操作。 That is to say you should try to only create the StringBuilder once as it has some overhead. 也就是说,您应该只创建一次StringBuilder ,因为它有一些开销。 It won't necessarily be a lot faster but it will be much more memory efficient. 它不一定会快很多,但会提高内存效率。

You should also probably only do this sanitation once instead of every time data is requested. 您也应该只执行一次这种卫生处理,而不是每次请求数据时进行一次。 If you're reading the data from the database you should consider sanitizing it once when the data is inserted into the database, so there is less work to do when reading and displaying it to the page. 如果要从数据库中读取数据,则应考虑在将数据插入数据库中时对其进行一次清理,因此在将其读取并显示到页面上时需要做的工作较少。

There may be a better way, but this is how I would go about solving the problem. 也许有更好的方法,但这就是我要解决的方法。

You will need to create a tree structure that contains your dictionary of words to be replaced. 您将需要创建一个树结构,其中包含要替换的单词词典。 The class may be something like: 该类可能类似于:

public class Node 
{
    public Dictionary<char, Node> Children;
    public bool IsWord;
}

Using a dictionary for the Children may not be the best choice, but it provides the easiest example here. 为儿童使用字典可能不是最佳选择,但此处提供了最简单的示例。 Also, you will need a constructor to initialize the Children field. 另外,您将需要一个构造函数来初始化Children字段。 The IsWord field is used to deal with the possibility that a redacted "word" may be the prefix of another redacted "word". IsWord字段用于处理已编辑的“单词”可能是另一个已编辑的“单词”的前缀的可能性。 For example, if you want to remove both "red" and "redress". 例如,如果要同时删除“红色”和“纠正”。

You will build the tree from each character in each of the replacement words. 您将使用每个替换单词中的每个字符来构建树。 For example: 例如:

public void AddWord ( string word ) 
{
    // NOTE: this assumes word is non-null and contains at least one character...

    Node currentNode = Root;

    for (int iIndex = 0; iIndex < word.Length; iIndex++)
    {
        if (currentNode.Children.ContainsKey(word[iIndex])))
        {
            currentNode = currentNode.Children[word[iIndex];
            continue;
        }

        Node newNode = new Node();
        currentNode.Children.Add(word[iIndex], newNode);
        currentNode = newNode;
    }

    // finished, mark the last node as being a complete word..
    currentNode.IsWord = true;
}

You'll need to deal with case sensitivity somewhere in there. 您需要在其中的某个地方处理区分大小写的问题。 Also, you only need to build the tree once, afterwards you can use it from any number of threads without worrying about locking because you will be only reading from it. 同样,您只需要构建一次树,之后就可以在任意数量的线程中使用它,而不必担心锁定,因为您只会从树中读取。 (Basically, I'm saying: store it in a static somewhere.) (基本上,我是说:将其存储在静态位置。)

Now, when you are ready to remove words from your string you will need to do the following: 现在,当您准备从字符串中删除单词时,您需要执行以下操作:

  • Create a StringBuilder instance to store the result 创建一个StringBuilder实例来存储结果
  • Parse through your source string, looking for the start and stop of a "word". 解析源字符串,查找“单词”的开头和结尾。 How you define "word" will matter. 您如何定义“单词”将很重要。 For simplicity I would suggest starting with Char.IsWhitespace as defining word separators. 为简单起见,我建议从Char.IsWhitespace开始定义单词分隔符。
  • Once you have determined that a range of character is a "word", starting from the root of the tree, locate the child node associated with the first character in "word". 从树的根部开始确定字符范围是“单词”后,找到与“单词”中第一个字符关联的子节点。
  • If you do not find a child node, the entire word is added to the StringBuilder 如果找不到子节点,则将整个单词添加到StringBuilder
  • If you find a child node, you continue with the next character matching against Children of the current node, until you either run out of characters or out of nodes. 如果找到子节点,则继续与当前节点的“子节点”匹配的下一个字符,直到字符用完或节点用完。
  • If you reach the end of the "word", check the last node's IsWord field. 如果到达“单词”的末尾,请检查最后一个节点的IsWord字段。 If true the word is excluded, do not add it to the StringBuilder . 如果为true ,则排除该单词,请勿将其添加到StringBuilder If IsWord is false , the word is not replaced and you add it to the StringBuilder 如果IsWordfalse ,则不替换该单词,而是将其添加到StringBuilder
  • Repeat until you have exhausted the input string. 重复直到您用尽了输入字符串。

You will also need to add word separators to the StringBuilder , hopefully that will be obvious as you parse the input string. 您还需要在StringBuilder添加单词分隔符,希望在解析输入字符串时显而易见。 If you are careful to only use the start and stop indices within the input string, you should be able to parse the entire string without creating any garbage strings. 如果您仅在输入字符串中使用开始索引和停止索引,则应该能够解析整个字符串而无需创建任何垃圾字符串。

When all of this is done, use StringBuilder.ToString() to get your final result. 完成所有这些操作后,使用StringBuilder.ToString()获得最终结果。

You may also need to consider Unicode surrogate codepoints, but you can probably get away without worrying about it. 您可能还需要考虑Unicode代理码点,但你也许可以蒙混过关,而不必担心它。

Beware, I typed this code here directly, so syntax errors, typos and other accidental misdirections are probably included. 当心,我直接在此处键入此代码,因此可能包括语法错误,错别字和其他意外的误导。

The real regular expression solution would be: 真正的正则表达式解决方案是:

var filteredWord = new Regex(@"\b(?:" + string.Join("|", FilteredWords.Select(Regex.Escape)) + @")\b", RegexOptions.Compiled);
text = filteredWord.Replace(text, "[REMOVED]");

I don't know whether this is faster (but note that it also only replaces whole words). 我不知道这是否更快(但请注意,它也只能替换整个单词)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM