从前面带有分隔符的文本中删除单词（使用正则表达式）

Question

I need to remove words from the text with separators next to them. 我需要从文本旁边删除带有分隔符的单词。 I already removed words but I don't know how I can remove separators at the same time. 我已经删除了单词，但是不知道如何同时删除分隔符。 Any suggestions? 有什么建议么？
At the moment I have: 目前，我有：

static void Main(string[] args)
        {
            Program p = new Program();
            string text = "";
            text = p.ReadText("Duomenys.txt", text);
            string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
            char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
            p.DeleteWordsFromText(text, wordsToDelete, separators);
        }

        public string ReadText(string file, string text)
        {     
            text = File.ReadAllText(file);           
            return text;
        }

        public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
        {
            Console.WriteLine(text);
            for (int i = 0; i < wordsToDelete.Length; i++)
            {
                text = Regex.Replace(text, wordsToDelete[i], String.Empty);
            }
            Console.WriteLine("-------------------------------------------");
            Console.WriteLine(text);
        }

The results should be: 结果应为：

how are you?
I am  good.

I have: 我有：

, how are you?
, I am . good.

Duomenys.txt Duomenys.txt

Hello, how are you? 
Thanks, I am kinda. good.

Answer 1

You can build the regex like follows: 您可以按照以下方式构建正则表达式：

var regex = new Regex(@"\b(" 
    + string.Join("|", wordsToDelete.Select(Regex.Escape)) + ")(" 
    + string.Join("|", separators.Select(c => Regex.Escape(new string(c, 1)))) + ")?");

Explanation: 说明：

the \\b at the start matches a word boundary. 开头的\\ b与单词边界匹配。 Just in case you get "XYZThanks" 以防万一你得到“ XYZ感谢”
the next part builds a regex construct matching any of the wordsToDelete 下一部分将构建匹配任何wordsToDelete的正则表达式构造
the last part builds a regex construct matching any of the separators; 最后一部分构建匹配任何分隔符的正则表达式构造； the trailing "?" 尾随的“？” is there because you said you want to replace the word also if no separator follows 在那里是因为您说过如果没有分隔符，您也想替换单词

Answer 2

You may build a regex like 您可以构建一个正则表达式

\b(?:Hello|Thanks|kinda)\b[ .,!?:;()    ]*

where \\b(?:Hello|Thanks|kinda)\\b will match any words to delete as whole words and [ .,!?:;() ]* will match all your separators 0 or more times after the words to delete. 其中\\b(?:Hello|Thanks|kinda)\\b将与要删除的所有单词匹配为整个单词，而[ .,!?:;() ]*将与要删除的单词之后的所有分隔符匹配0次或更多次。

The C# solution : C＃解决方案：

char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
string SepPattern = new String(separators).Replace(@"\", @"\\").Replace("^", @"\^").Replace("-", @"\-").Replace("]", @"\]");
var pattern = $@"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b[{SepPattern}]*";
// => \b(?:Hello|Thanks|kinda)\b[ .,!?:;()  ]*
Regex rx = new Regex(pattern, RegexOptions.Compiled);
// RegexOptions.IgnoreCase can be added to the above flags for case insensitive matching: RegexOptions.IgnoreCase | RegexOptions.Compiled
DeleteWordsFromText("Hello, how are you?", rx);
DeleteWordsFromText("Thanks, I am kinda. good.", rx);

Here is the DeleteWordsFromText method: 这是DeleteWordsFromText方法：

public static void DeleteWordsFromText(string text, Regex p)
{
    Console.WriteLine($"---- {text} ----");
    Console.WriteLine(p.Replace(text, ""));
}

Output: 输出：

---- Hello, how are you? ----
how are you?
---- Thanks, I am kinda. good. ----
I am good.

Notes : 注意事项 ：

string SepPattern = new String(separators).Replace(@"\\", @"\\\\").Replace("^", @"\\^").Replace("-", @"\\-").Replace("]", @"\\]"); - it is a separator pattern that will be used inside a character class, and since only ^ , - , \\ , ] chars require escaping inside a character class, only these chars are escaped -这是一个分隔符模式，将在字符类内使用，并且由于仅^ ， - ， \\ ， ]字符需要在字符类内转义，因此仅对这些字符进行转义
$@"\\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\\b" - this will build the alternation from the words to delete and will only match them as whole words. $@"\\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\\b" -这将构建要删除的单词的替代形式，并且仅将它们与整个单词匹配。

Pattern details 图案细节

\\b - word boundary \\b单词边界
(?: - start of a non-capturing group: (?: -非捕获组的开始：
- Hello - Hello word Hello - Hello字
- | - or - 要么
- Thanks - Thanls word Thanks - Thanls词
- | - or - 要么
- kinda - kinda word kinda - kinda词
) - end of the group ) -小组结束
\\b - word boundary \\b单词边界
[ .,!?:;() ]* - any 0+ chars inside the character class. [ .,!?:;() ]* -字符类中的任何0+个字符。

See the regex demo . 参见regex演示 。

Answer 3

I would not use Regex. 我不会使用正则表达式。 In 3 months from now, you'll not understand the Regex any more and fixing bugs is a hard thing then. 从现在开始的3个月内，您将不再对Regex有所了解，并且修复bug很难。

I would use simple loops. 我会使用简单的循环。 Everyone will understand: 每个人都会明白：

public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
    Console.WriteLine(text);
    foreach (string word in wordsToDelete)
    {
        foreach(char separator in separators)
        {
            text = text.Replace(word + separator, String.Empty);
        }
    }
    Console.WriteLine("-------------------------------------------");
    Console.WriteLine(text);
}

从前面带有分隔符的文本中删除单词（使用正则表达式）

问题描述

3 个解决方案

解决方案1
2 2018-11-17 12:30:04

解决方案2
2 已采纳 2018-11-17 13:10:33

解决方案3
1 2018-11-17 12:37:25

从前面带有分隔符的文本中删除单词（使用正则表达式）

问题描述

3 个解决方案

解决方案1 2 2018-11-17 12:30:04

解决方案2 2 已采纳 2018-11-17 13:10:33

解决方案3 1 2018-11-17 12:37:25

解决方案1
2 2018-11-17 12:30:04

解决方案2
2 已采纳 2018-11-17 13:10:33

解决方案3
1 2018-11-17 12:37:25