简体   繁体   English

C#-使用正则表达式删除字符

[英]C# - Remove characters with regex

I have a text file, and I need to remove some trailing delimiters. 我有一个文本文件,我需要删除一些结尾的定界符。 The text file looks like this: 文本文件如下所示:

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
                2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
                  2|'Nguyen Van B'|'Nguyen Van B'|39";

I want to remove the char "|" 我要删除字符“ |” In the string "Nguyen Van | A" and "Nguyen | Van B" 在字符串“ Nguyen Van | A”和“ Nguyen | Van B”中

So I think the best way is to do a Regex replace? 因此,我认为最好的方法是更换正则表达式? Can anyone help me with this regex? 谁能帮我这个正则表达式?

Thanks 谢谢

The regex should be: 正则表达式应为:

(?<=^[^']*'([^']*'[^']*')*[^']*)\|

to be used Multiline... so 多线使用...所以

var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39

2|'Nguyen Van B'|'Nguyen| 2 |'Nguyen Van B'|'Nguyen | Van B'|39"; Van B'| 39“;

string replaced = rx.Replace(text, string.Empty);

Example: http://ideone.com/PTdsg5 示例: http//ideone.com/PTdsg5

I strongly suggest against using it... To explain why... Try to comprehend the regular expression. 强烈建议不要使用它。解释原因。尝试理解正则表达式。 If you can comprehend it, then you can use it :-) 如果您可以理解它,则可以使用它:-)

I would write a simple state machine that counts ' and replaces the | 我将编写一个简单的状态机,该状态机计算'并替换| when the counted ' is odd. 当计数的'是奇数。

You mentioned using the multiline regex is taking too long and asked about the state machine approach. 您提到使用多行正则表达式花费的时间太长,并询问有关状态机的方法。 So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex). 因此,这里是一些使用函数执行操作的代码(请注意,该函数可能需要使用一些清理功能,但它可以显示出这种想法并且比regex更快地工作)。 In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds. 在我的测试中,使用不带多行的正则表达式,我可以在大约34秒内处理1,000,000行(在内存中,而不是写入文件)。 Using the state-machine approach it was about 4 seconds. 使用状态机方法大约需要4秒钟。

string RemoveInternalPipe(string line)
{
    int count = 0;
    var temp = new List<char>(line.Length);
    foreach (var c in line)
    {
        if (c == '\'')
        {
            ++count;
        }
        if (c == '|' && count % 2 != 0) continue;
        temp.Add(c);
    }
    return new string(temp.ToArray());
};

File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));

To compare the performance against the Regex version (without the multiline option), you could run this code: 要将性能与Regex版本(不带多行选项)进行比较,可以运行以下代码:

var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM