简体   繁体   中英

split string with regex using a release character and separators

I need to parse an EDI file, where the separators are + , : and ' signs and the escape (release) character is ? . You first split into segments

var data = "NAD+UC+ABC2378::92++XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 71+Duzce+Seferihisar / IZMIR++35460+TR"

var segments = data.Split('\'');

then each segment is split into segment data elements by + , then segment data elements are split into component data elements via : .

var dataElements = segments[0].Split('+');

the above sample string is not parsed correctly because of the use of release character. I have special code dealing with this, but I am thinking that this should be all doable using

Regex.Split(data, separator);

I am not familiar with Regex'es and could not find a way to do this so far. The best I came up so far is

string[] lines = Regex.Split(data, @"[^?]\+");

which omits the character before + sign.

NA
U
ABC2378::9
+XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzc
Seferihisar / IZMI
+3546
TR

Correct Result Should be:

NAD
UC
ABC2378::92

XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzce
Seferihisar / IZMIR
35460
TR

So the question is this doable with Regex.Split, and what should the regex separator look like.

I can see that you want to split around plus signs + only if they are not preceded (escaped) by a question mark ? . This can be done using the following:

(?<!\?)\+

This matches one or more + signs if they are not preceded by a question mark ? .

Edit: The problem or bug with the previous expression if that it doesn't handle situations like ??+ or ???+ or or ????+ , in other words it doesn't handle situations where ? s are used to escape themselves.

We can solve this problem by noticing that if there is an odd number of ? preceding a + then the last one is definitely escaping the + so we must not split, but if there is an even number of ? before a plus then those cancel out each leaving the + so we should split around it.

From the previous observation we should come up with an expression that matches a + only if it is preceded by an even number of question marks ? , and here it is:

(?<!(^|[^?])(\?\?)*\?)\+
string[] lines = Regex.Split(data, @"\+"); 

would it meet the requirement??

Here is the edit for escaping the '?' before '+'.

string[] lines = Regex.Split(data, @"(?<!\?)[\+]+"); 

The '+' end the end would match multiple consecutive occurances of seperator '+'. If you want white spaces instead.

string[] lines = Regex.Split(data, @"(?<!\?)[\+]"); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM