简体   繁体   中英

C# - Remove characters with regex

I have a text file, and I need to remove some trailing delimiters. The text file looks like this:

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
                2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
                  2|'Nguyen Van B'|'Nguyen Van B'|39";

I want to remove the char "|" In the string "Nguyen Van | A" and "Nguyen | Van B"

So I think the best way is to do a Regex replace? Can anyone help me with this regex?

Thanks

The regex should be:

(?<=^[^']*'([^']*'[^']*')*[^']*)\|

to be used Multiline... so

var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39

2|'Nguyen Van B'|'Nguyen| Van B'|39";

string replaced = rx.Replace(text, string.Empty);

Example: http://ideone.com/PTdsg5

I strongly suggest against using it... To explain why... Try to comprehend the regular expression. If you can comprehend it, then you can use it :-)

I would write a simple state machine that counts ' and replaces the | when the counted ' is odd.

You mentioned using the multiline regex is taking too long and asked about the state machine approach. So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex). In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds. Using the state-machine approach it was about 4 seconds.

string RemoveInternalPipe(string line)
{
    int count = 0;
    var temp = new List<char>(line.Length);
    foreach (var c in line)
    {
        if (c == '\'')
        {
            ++count;
        }
        if (c == '|' && count % 2 != 0) continue;
        temp.Add(c);
    }
    return new string(temp.ToArray());
};

File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));

To compare the performance against the Regex version (without the multiline option), you could run this code:

var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM