[英]C# - Remove characters with regex
I have a text file, and I need to remove some trailing delimiters. 我有一个文本文件,我需要删除一些结尾的定界符。 The text file looks like this:
文本文件如下所示:
string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen Van B'|39";
I want to remove the char "|" 我要删除字符“ |” In the string "Nguyen Van | A" and "Nguyen | Van B"
在字符串“ Nguyen Van | A”和“ Nguyen | Van B”中
So I think the best way is to do a Regex replace? 因此,我认为最好的方法是更换正则表达式? Can anyone help me with this regex?
谁能帮我这个正则表达式?
Thanks 谢谢
The regex should be: 正则表达式应为:
(?<=^[^']*'([^']*'[^']*')*[^']*)\|
to be used Multiline... so 多线使用...所以
var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);
string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen| 2 |'Nguyen Van B'|'Nguyen | Van B'|39";
Van B'| 39“;
string replaced = rx.Replace(text, string.Empty);
Example: http://ideone.com/PTdsg5 示例: http : //ideone.com/PTdsg5
I strongly suggest against using it... To explain why... Try to comprehend the regular expression. 我强烈建议不要使用它。解释原因。尝试理解正则表达式。 If you can comprehend it, then you can use it :-)
如果您可以理解它,则可以使用它:-)
I would write a simple state machine that counts '
and replaces the |
我将编写一个简单的状态机,该状态机计算
'
并替换|
when the counted '
is odd. 当计数的
'
是奇数。
You mentioned using the multiline regex is taking too long and asked about the state machine approach. 您提到使用多行正则表达式花费的时间太长,并询问有关状态机的方法。 So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex).
因此,这里是一些使用函数执行操作的代码(请注意,该函数可能需要使用一些清理功能,但它可以显示出这种想法并且比regex更快地工作)。 In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds.
在我的测试中,使用不带多行的正则表达式,我可以在大约34秒内处理1,000,000行(在内存中,而不是写入文件)。 Using the state-machine approach it was about 4 seconds.
使用状态机方法大约需要4秒钟。
string RemoveInternalPipe(string line)
{
int count = 0;
var temp = new List<char>(line.Length);
foreach (var c in line)
{
if (c == '\'')
{
++count;
}
if (c == '|' && count % 2 != 0) continue;
temp.Add(c);
}
return new string(temp.ToArray());
};
File.WriteAllLines(@"yourOutputFile",
File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));
To compare the performance against the Regex
version (without the multiline option), you could run this code: 要将性能与
Regex
版本(不带多行选项)进行比较,可以运行以下代码:
var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.