简体   繁体   English


[英]Why is my Regex for removing special characters adding more words to my text?

I encountered the problem when I tired to run my regex function on my text which can be found here .当我厌倦了在我的文本上运行我的正则表达式 function 时,我遇到了这个问题,可以在这里找到。

With a HttpRequest I fetch the text form the link above.使用 HttpRequest,我从上面的链接中获取文本。 Then I run my regex to clean up the text before filtering the most occurrences of a certain word.然后我运行我的正则表达式来清理文本,然后过滤掉出现次数最多的某个词。

After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.清理单词后,我用空格拆分字符串并将其添加到字符串数组中,注意到索引数量存在巨大差异。

Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.有谁知道为什么会发生这种情况,因为单词“ the ”出现的结果是 6806 次匹配。
raw data correct answer is 6806原始数据正确答案是 6806

And with my regex I get - 8073 hits使用我的正则表达式,我得到了 8073 次点击

with regex用正则表达式

The regex i'm using is here in the sandbox with the text and below in the code.我正在使用的正则表达式在带有文本的沙箱中,在代码下方。

//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);

string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);

string CleanByRegex(string rawSource)
    Regex r = RemoveSpecialChars();
    return r.Replace(rawSource, " ");

//  arr {string[220980]} - with regex
//  arr {string[157594]} - without regex

foreach (var word in arr)
    // some logic


partial class Program
    [GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
    private static partial Regex RemoveSpecialChars();

I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.我试过调试它,我怀疑我正在添加尾随空格,但我不知道如何处理它。

I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.我厌倦了添加一个空格删除正则表达式,我删除了多个空格并将其替换为一个空格。

the regex would look something like - [ ]{2,}"正则表达式看起来像 - [ ]{2,}"

partial class Program
    [GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
    private static partial Regex RemoveWhiteSpaceTrails();

It would be helpful if you describe what you're trying to clean up.如果您描述要清理的内容会很有帮助。 However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation.但是,您的具体问题是可以回答的:从沙箱中我看到您正在删除换行符和标点符号。 This can definitely lead to occurrences of the that weren't there before:这肯定会导致以前不存在the出现:

The quick brown fox jumps over the
lazy dog
//the+newline does not match

//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the+space*

If you change your search to something not so common, for example Seward , then you should see the same results before and after the regex.如果您将搜索更改为不太常见的内容,例如Seward ,那么您应该会在正则表达式前后看到相同的结果。

The reason I believe the regex created more text while I was replacing it with string.empty or " " .我相信正则表达式在我用 string.empty 或" "替换它时创建了更多文本的原因。 Is not true I just created more matches.不是真的,我只是创建了更多匹配项。

Is because I thought the search in Chrome via ctrl + f would give me all the words for a certain search and this necessarily isn't true.是因为我认为通过ctrl + f在 Chrome 中搜索会为我提供特定搜索的所有单词,而这不一定是真的。

I tried my code and instead I added a subset of the Lorem Ipsum text.我尝试了我的代码,但我添加了 Lorem Ipsum 文本的一个子集。 This is because I questioned the search on Chrome to see if it's really the correct answer.这是因为我质疑在 Chrome 上搜索是否真的是正确答案。

Short answer is NO.简短的回答是否定的。

If I was to search for " the " that would mean I won't get the "the+Environmental.NewLine" which @simmetric proved,如果我要搜索“ the ”,那将意味着我不会得到@simmetric证明的"the+Environmental.NewLine"

Another scenario is sentences that begins with the word "The " .另一种情况是以单词"The "开头的句子。 Since I am curious about the words in the Text I used the following regex \w+ to get the words and returned a MatchCollection (IList<Match>()) That I later looped through to add the value to my dictionary.因为我对文本中的单词很好奇,所以我使用了以下正则表达式\w+来获取单词并返回了一个 MatchCollection (IList<Match>()) ,我稍后循环将其值添加到我的字典中。

Code Demonstration代码演示

var rawSource = "Some text"
var words = CleanByRegex(rawSource);

IList<Match> CleanByRegex(string rawSource)
    IList<Match> r = Regex.Matches(rawSource, "\\w+");
    return r;

foreach (var word in words)
    if (word.Value.Length >= 1) // at least 3 letters and has any letters
        if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
            dictionary[word.Value] = dictionary[word.Value] + 1; //Increment the count
            dictionary[word.Value] = 1; //put it in the dictionary with a count 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM