简体   繁体   English

为什么我的用于删除特殊字符的正则表达式会在我的文本中添加更多单词?

[英]Why is my Regex for removing special characters adding more words to my text?

I encountered the problem when I tired to run my regex function on my text which can be found here .当我厌倦了在我的文本上运行我的正则表达式 function 时,我遇到了这个问题,可以在这里找到。

With a HttpRequest I fetch the text form the link above.使用 HttpRequest,我从上面的链接中获取文本。 Then I run my regex to clean up the text before filtering the most occurrences of a certain word.然后我运行我的正则表达式来清理文本,然后过滤掉出现次数最多的某个词。

After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.清理单词后,我用空格拆分字符串并将其添加到字符串数组中,注意到索引数量存在巨大差异。

Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.有谁知道为什么会发生这种情况,因为单词“ the ”出现的结果是 6806 次匹配。
raw data correct answer is 6806原始数据正确答案是 6806

And with my regex I get - 8073 hits使用我的正则表达式,我得到了 8073 次点击

with regex用正则表达式

The regex i'm using is here in the sandbox with the text and below in the code.我正在使用的正则表达式在带有文本的沙箱中,在代码下方。

//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);

string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);

string CleanByRegex(string rawSource)
{
    Regex r = RemoveSpecialChars();
    return r.Replace(rawSource, " ");
}

//  arr {string[220980]} - with regex
//  arr {string[157594]} - without regex

foreach (var word in arr)
{
    // some logic

}


```
partial class Program
{
    [GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
    private static partial Regex RemoveSpecialChars();
}
```



I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.我试过调试它,我怀疑我正在添加尾随空格,但我不知道如何处理它。

I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.我厌倦了添加一个空格删除正则表达式,我删除了多个空格并将其替换为一个空格。

the regex would look something like - [ ]{2,}"正则表达式看起来像 - [ ]{2,}"

partial class Program
{
    [GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
    private static partial Regex RemoveWhiteSpaceTrails();
}

It would be helpful if you describe what you're trying to clean up.如果您描述要清理的内容会很有帮助。 However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation.但是,您的具体问题是可以回答的:从沙箱中我看到您正在删除换行符和标点符号。 This can definitely lead to occurrences of the that weren't there before:这肯定会导致以前不存在the出现:

The quick brown fox jumps over the
lazy dog
//the+newline does not match

//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the+space*

If you change your search to something not so common, for example Seward , then you should see the same results before and after the regex.如果您将搜索更改为不太常见的内容,例如Seward ,那么您应该会在正则表达式前后看到相同的结果。

The reason I believe the regex created more text while I was replacing it with string.empty or " " .我相信正则表达式在我用 string.empty 或" "替换它时创建了更多文本的原因。 Is not true I just created more matches.不是真的,我只是创建了更多匹配项。

Is because I thought the search in Chrome via ctrl + f would give me all the words for a certain search and this necessarily isn't true.是因为我认为通过ctrl + f在 Chrome 中搜索会为我提供特定搜索的所有单词,而这不一定是真的。

I tried my code and instead I added a subset of the Lorem Ipsum text.我尝试了我的代码,但我添加了 Lorem Ipsum 文本的一个子集。 This is because I questioned the search on Chrome to see if it's really the correct answer.这是因为我质疑在 Chrome 上搜索是否真的是正确答案。

Short answer is NO.简短的回答是否定的。

If I was to search for " the " that would mean I won't get the "the+Environmental.NewLine" which @simmetric proved,如果我要搜索“ the ”,那将意味着我不会得到@simmetric证明的"the+Environmental.NewLine"

Another scenario is sentences that begins with the word "The " .另一种情况是以单词"The "开头的句子。 Since I am curious about the words in the Text I used the following regex \w+ to get the words and returned a MatchCollection (IList<Match>()) That I later looped through to add the value to my dictionary.因为我对文本中的单词很好奇,所以我使用了以下正则表达式\w+来获取单词并返回了一个 MatchCollection (IList<Match>()) ,我稍后循环将其值添加到我的字典中。

Code Demonstration代码演示

var rawSource = "Some text"
var words = CleanByRegex(rawSource);

IList<Match> CleanByRegex(string rawSource)
{
    IList<Match> r = Regex.Matches(rawSource, "\\w+");
    return r;
}

foreach (var word in words)
{
    
    if (word.Value.Length >= 1) // at least 3 letters and has any letters
    {
        if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
            dictionary[word.Value] = dictionary[word.Value] + 1; //Increment the count
        else
            dictionary[word.Value] = 1; //put it in the dictionary with a count 1
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM