简体   繁体   English

在java中拆分字符串:具有指定长度的lookbehind

[英]Splitting string in java : lookbehind with specified length

I want to split a string after the letter "K" or "L" except when either is followed by the letter "P". 我想在字母“K”或“L”之后分割一个字符串,除非后面跟着字母“P”。 Meanwhile, I hope not to split if the substring length less than 4 when the string is split on a location. 同时,我希望当字符串在一个位置上拆分时,如果子串长度小于4,则不要拆分。 For example: 例如:

- Input:
AYLAKPHKKDIV

- Expected Output
AYLAKPHK
KDIV

Now, I have achieved to split string after the letter "K" or "L" except when either is followed by the letter "P". 现在,我已经实现了在字母“K”或“L”之后分割字符串,除非后面跟着字母“P”。 My regular expression is (?<=[K|R])(?!P) . 我的正则表达式是(?<=[K|R])(?!P)

My result:
AYLAKPHK
K
DIV

However, I don't know how to ignore the split location where the substring length less than 4. 但是,我不知道如何忽略子串长度小于4的拆分位置。

My Demo 我的演示

I hope not to split if the substring length less than 4 如果子串长度小于4,我希望不要拆分

In other words, you want to have 换句话说,你想拥有

  1. previous match (split) separated to current match with at least 4 characters, so ABCKABKKABCD would split into ABCK|ABKK|ABCD not but not into `ABCK|ABK|..... 一场比赛(分组)与至少4个字符的当前比赛分开,因此ABCKABKKABCD将分为ABCK|ABKK|ABCD而不是分为`ABCK | ABK | .....

  2. at least 4 characters after current split since ABCKAB after split ABCK|AB would have AB at the end which length is less than 4. ABCKAB分割ABCK|AB之后的当前分割后至少4个字符将在最后长度小于4的AB

To achieve first condition you can use \\G which represents place of previous match (or start of the string if there ware no matches yet). 要实现第一个条件,您可以使用\\G表示前一个匹配的位置(如果尚未匹配,则为字符串的开头)。 So first condition can look like (?<=\\G.{4,}) (WARNING: usually look-behind expects obvious maximal length of subregex it handles, but for some reasons .{4,} works here, which can be bug or feature added in Java 10 which I am using now. In case it complains about it, you can use some very big number which should be bigger than max amount of characters you expect between two splits like .{4,10000000} ) 所以第一个条件可能看起来像(?<=\\G.{4,}) (警告:通常看后面需要它处理的子矩阵的明显最大长度,但由于某些原因.{4,}在这里工作,这可能是bug我现在正在使用的Java 10中添加的功能。如果它抱怨它,你可以使用一些非常大的数字,它应该大于你期望的两个分割之间的最大字符数量.{4,10000000}

Second condition is simpler since it is just (?=.{4}) . 第二个条件更简单,因为它只是(?=.{4})

BTW you don't want | 顺便说一句,你不想要| in [K|R] as there it represents literal, not OR operator since by default any character in character set is alternative choice. [K|R]中,它表示文字,而不是OR运算符,因为默认情况下,字符集中的任何字符都是替代选择。 So [K|R] represents K OR | 所以[K|R]代表K OR | OR R . 或者R Use [KR] instead. 请改用[KR]

DEMO: DEMO:

String text = "AYLAKPHKKKKKKDIVK123KAB";
String regex = "(?<=[KR])(?!P)(?<=\\G.{4,})(?=.{4})";
for (String s : text.split(regex)){
    System.out.println("'"+s+"'");
}

Output: 输出:

'AYLAKPHK'
'KKKK'
'KDIVK'
'123KAB'

You could use matcher to match each substring, rather than split , if possible - you might find logic a bit easier to follow when you can consume characters, rather than having to identify a particular position . 如果可能的话,您可以使用matcher匹配每个子字符串,而不是split - 当您可以使用字符时,您可能会发现逻辑更容易理解,而不是必须识别特定位置 Match three or more characters followed by a ( K or R not followed by P with .{3,}?[KR](?!P) , ensure that it's followed by at least 4 characters with (?=.{4}) , OR, if the whole above pattern fails, match the whole rest of the string with .+$ : 匹配三个或更多字符后跟一个( KR后跟P后面跟着.{3,}?[KR](?!P) ,确保它后跟至少4个字符(?=.{4}) ,或者,如果整个上面的模式失败,则匹配字符串的其余部分.+$

String s = "AYLAKPHKKDIV";
List<String> arr = new ArrayList<String>();
Matcher m = Pattern.compile(".{3,}?[KR](?!P)(?=.{4})|.+$").matcher(s);
while(m.find()) {
  arr.add(m.group());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM