简体   繁体   English

从前面带有分隔符的文本中删除单词(使用正则表达式)

[英]Removing words from text with separators in front(using Regex)

I need to remove words from the text with separators next to them. 我需要从文本旁边删除带有分隔符的单词。 I already removed words but I don't know how I can remove separators at the same time. 我已经删除了单词,但是不知道如何同时删除分隔符。 Any suggestions? 有什么建议么?
At the moment I have: 目前,我有:

static void Main(string[] args)
        {
            Program p = new Program();
            string text = "";
            text = p.ReadText("Duomenys.txt", text);
            string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
            char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
            p.DeleteWordsFromText(text, wordsToDelete, separators);
        }

        public string ReadText(string file, string text)
        {     
            text = File.ReadAllText(file);           
            return text;
        }

        public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
        {
            Console.WriteLine(text);
            for (int i = 0; i < wordsToDelete.Length; i++)
            {
                text = Regex.Replace(text, wordsToDelete[i], String.Empty);
            }
            Console.WriteLine("-------------------------------------------");
            Console.WriteLine(text);
        }

The results should be: 结果应为:

how are you?
I am  good.

I have: 我有:

, how are you?
, I am . good.

Duomenys.txt Duomenys.txt

Hello, how are you? 
Thanks, I am kinda. good. 

You can build the regex like follows: 您可以按照以下方式构建正则表达式:

var regex = new Regex(@"\b(" 
    + string.Join("|", wordsToDelete.Select(Regex.Escape)) + ")(" 
    + string.Join("|", separators.Select(c => Regex.Escape(new string(c, 1)))) + ")?");

Explanation: 说明:

  • the \\b at the start matches a word boundary. 开头的\\ b与单词边界匹配。 Just in case you get "XYZThanks" 以防万一你得到“ XYZ感谢”
  • the next part builds a regex construct matching any of the wordsToDelete 下一部分将构建匹配任何wordsToDelete的正则表达式构造
  • the last part builds a regex construct matching any of the separators; 最后一部分构建匹配任何分隔符的正则表达式构造; the trailing "?" 尾随的“?” is there because you said you want to replace the word also if no separator follows 在那里是因为您说过如果没有分隔符,您也想替换单词

You may build a regex like 您可以构建一个正则表达式

\b(?:Hello|Thanks|kinda)\b[ .,!?:;()    ]*

where \\b(?:Hello|Thanks|kinda)\\b will match any words to delete as whole words and [ .,!?:;() ]* will match all your separators 0 or more times after the words to delete. 其中\\b(?:Hello|Thanks|kinda)\\b将与要删除的所有单词匹配为整个单词,而[ .,!?:;() ]*将与要删除的单词之后的所有分隔符匹配0次或更多次。

The C# solution : C#解决方案

char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
string SepPattern = new String(separators).Replace(@"\", @"\\").Replace("^", @"\^").Replace("-", @"\-").Replace("]", @"\]");
var pattern = $@"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b[{SepPattern}]*";
// => \b(?:Hello|Thanks|kinda)\b[ .,!?:;()  ]*
Regex rx = new Regex(pattern, RegexOptions.Compiled);
// RegexOptions.IgnoreCase can be added to the above flags for case insensitive matching: RegexOptions.IgnoreCase | RegexOptions.Compiled
DeleteWordsFromText("Hello, how are you?", rx);
DeleteWordsFromText("Thanks, I am kinda. good.", rx);

Here is the DeleteWordsFromText method: 这是DeleteWordsFromText方法:

public static void DeleteWordsFromText(string text, Regex p)
{
    Console.WriteLine($"---- {text} ----");
    Console.WriteLine(p.Replace(text, ""));
}

Output: 输出:

---- Hello, how are you? ----
how are you?
---- Thanks, I am kinda. good. ----
I am good.

Notes : 注意事项

  • string SepPattern = new String(separators).Replace(@"\\", @"\\\\").Replace("^", @"\\^").Replace("-", @"\\-").Replace("]", @"\\]"); - it is a separator pattern that will be used inside a character class, and since only ^ , - , \\ , ] chars require escaping inside a character class, only these chars are escaped -这是一个分隔符模式,将在字符类内使用,并且由于仅^-\\]字符需要在字符类内转义,因此仅对这些字符进行转义
  • $@"\\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\\b" - this will build the alternation from the words to delete and will only match them as whole words. $@"\\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\\b" -这将构建要删除的单词的替代形式,并且仅将它们与整个单词匹配。

Pattern details 图案细节

  • \\b - word boundary \\b单词边界
  • (?: - start of a non-capturing group: (?: -非捕获组的开始:
    • Hello - Hello word Hello - Hello
    • | - or - 要么
    • Thanks - Thanls word Thanks - Thanls
    • | - or - 要么
    • kinda - kinda word kinda - kinda
  • ) - end of the group ) -小组结束
  • \\b - word boundary \\b单词边界
  • [ .,!?:;() ]* - any 0+ chars inside the character class. [ .,!?:;() ]* -字符类中的任何0+个字符。

See the regex demo . 参见regex演示

I would not use Regex. 我不会使用正则表达式。 In 3 months from now, you'll not understand the Regex any more and fixing bugs is a hard thing then. 从现在开始的3个月内,您将不再对Regex有所了解,并且修复bug很难。

I would use simple loops. 我会使用简单的循环。 Everyone will understand: 每个人都会明白:

public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
    Console.WriteLine(text);
    foreach (string word in wordsToDelete)
    {
        foreach(char separator in separators)
        {
            text = text.Replace(word + separator, String.Empty);
        }
    }
    Console.WriteLine("-------------------------------------------");
    Console.WriteLine(text);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM