简体   繁体   English

使用释放字符和分隔符将正则表达式拆分为字符串

[英]split string with regex using a release character and separators

I need to parse an EDI file, where the separators are + , : and ' signs and the escape (release) character is ? 我需要解析一个EDI文件,其中的分隔符+:'标志和逃逸(释放)性格? . You first split into segments 你首先分成几个部分

var data = "NAD+UC+ABC2378::92++XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 71+Duzce+Seferihisar / IZMIR++35460+TR"

var segments = data.Split('\'');

then each segment is split into segment data elements by + , then segment data elements are split into component data elements via : . 然后通过+将每个段拆分成段数据元素,然后通过以下方式将段数据元素拆分为组件数据元素:

var dataElements = segments[0].Split('+');

the above sample string is not parsed correctly because of the use of release character. 由于使用了释放字符,因此无法正确解析上面的示例字符串。 I have special code dealing with this, but I am thinking that this should be all doable using 我有特殊的代码处理这个问题,但我认为这应该是可行的

Regex.Split(data, separator);

I am not familiar with Regex'es and could not find a way to do this so far. 我对Regex'es不熟悉,到目前为止找不到办法。 The best I came up so far is 我到目前为止最好的是

string[] lines = Regex.Split(data, @"[^?]\+");

which omits the character before + sign. 省略+符号前的字符。

NA
U
ABC2378::9
+XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzc
Seferihisar / IZMI
+3546
TR

Correct Result Should be: 正确的结果应该是:

NAD
UC
ABC2378::92

XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzce
Seferihisar / IZMIR
35460
TR

So the question is this doable with Regex.Split, and what should the regex separator look like. 所以问题是Regex.Split可行,并且正则表达式分隔符应该是什么样的。

I can see that you want to split around plus signs + only if they are not preceded (escaped) by a question mark ? 我可以看到你想要分开加号+只有当它们没有被问号前面(逃脱)时 ? . This can be done using the following: 这可以使用以下方法完成:

(?<!\?)\+

This matches one or more + signs if they are not preceded by a question mark ? 如果它们之前没有问号,则匹配一个或多个+符号? .

Edit: The problem or bug with the previous expression if that it doesn't handle situations like ??+ or ???+ or or ????+ , in other words it doesn't handle situations where ? 编辑:上一个表达式的问题或错误,如果它不处理像??+???+或或????+ ,换句话说它不处理的情况? s are used to escape themselves. s习惯于逃避自己。

We can solve this problem by noticing that if there is an odd number of ? 我们可以通过注意到如果有奇数?来解决这个问题? preceding a + then the last one is definitely escaping the + so we must not split, but if there is an even number of ? 在一个+然后最后一个肯定是逃避+所以我们不能拆分,但如果有一个偶数? before a plus then those cancel out each leaving the + so we should split around it. 在一个加号然后那些取消每个离开+所以我们应该分开它。

From the previous observation we should come up with an expression that matches a + only if it is preceded by an even number of question marks ? 从前面的观察中我们应该得出一个只有在 +前面有偶数个问号的表达式? , and here it is: ,这里是:

(?<!(^|[^?])(\?\?)*\?)\+
string[] lines = Regex.Split(data, @"\+"); 

would it meet the requirement?? 它会满足要求吗?

Here is the edit for escaping the '?' 这是逃避'?'的编辑 before '+'. 在'+'之前。

string[] lines = Regex.Split(data, @"(?<!\?)[\+]+"); 

The '+' end the end would match multiple consecutive occurances of seperator '+'. 结尾的“+”结束将匹配分隔符“+”的多个连续出现。 If you want white spaces instead. 如果你想要白色空格。

string[] lines = Regex.Split(data, @"(?<!\?)[\+]"); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM