简体   繁体   中英

Split regex with multi char delimiters

I'm battling to find the correct way to split a string using delimiters with multiple characters in Java (eg '. [1a]' or '.(2b)')

Here's a test case:

String str1 = "This is test 1  .  This is test 2  [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";

Pattern regex = Pattern.compile("\\.\\s{0,}\\[.*\\]\\s{0,}|\\.\\s{0,}\\(.*\\)\\s{0,}|\\.\\s{0}");

System.out.println(Arrays.toString(regex.split(text)));

The output that I'm aiming for is the following (spaces in the beginning or end of each sub-string are fine, the important thing is to keep the delimiter):

[This is test 1 . , This is test 2 [2 b]. , This is test 3 (3). , This is test 4.[4a] , This is a test 5 . , This is test 6 . (6,six)]

However, this is the output I'm getting:

[This is test 1 , This is test 2 [2 b], This is test 3 (3), This is test 4, This is a test 5 , This is test 6 ]

Also tried dropping the "\\\\s", a different notation for spaces like Pattern.compile("\\\\s+\\\\[.?\\\\]\\\\s+\\\\.|\\\\s+\\\\(.?\\\\)\\\\s+\\\\.|\\\\.\\\\s+") and experimented with lookaheads like Pattern.compile("(?<=[.[*]\\\\s+])|(?=[.(*)]\\\\s+)|\\\\.") but neither helped :|

This might be a bit tricky. Focus on the common characteristics that the wanted group ends when the next one begins - there is a letter \\w so use that to detect a new group.

Use this advantage to replace it with self and the \\n before it, thus \\n$1 and each group will appear on a new line which is fairly easy to extract. The wanted Regex (see Regex101 ) is :

 (?<!\w )(\w)(?=\w{2,})
  • Mind the one (space) at the first character of the Regex!

This would produce an output as:

This is test 1  . 
This is test 2  [2 b].
This is test 3 (3).
This is test 4.[4a]
This is a test 5 .
This is test 6 . (6,six)

In Java, the code would be using the methods replaceAll and split (thanks @jmng for the improvement):

String str1 = "This is test 1  .  This is test 2  [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";

Pattern reg1 = Pattern.compile(" (?<!\\w )(\\w)(?=\\w{2,})");              // Preparation
Pattern regNewline = Pattern.compile("\n");                                // Split
String[] array = regNewline.split(reg1.matcher(str1).replaceAll("\n$1"));  // Apply


Arrays.stream(array).forEach(System.out::println);                         // Test it

One possibility if spaces in the beginning or end of each sub-string are acceptable and using split could be to use an alternation with a positive lookbehind checking for your different requirements.

In Java you have to determine the minimum and maximum possible lengths of the lookbehind so you might for example take 10 for your example data.

(?<=\\[[^]]{1,10}]\\.|\\.\\[[^]]{1,10}]|\\([^)]{1,10}\\)\\.| \\. (?!\\([^)]+\\)))

In Java:

(?<=\\\\[[^]]{1,10}]\\\\.|\\\\.\\\\[[^]]{1,10}]|\\\\([^)]{1,10}\\\\)\\\\.| \\\\. (?!\\\\([^)]+\\\\)))

Explanation

  • (?<= Positive lookbehind to check what is on the left is
    • \\[[^]]{1,10}]\\. Use a negated character class to match between square brackets and a quantifier that repeats not a closing bracket 1 - 10 times followed by a dot
    • | Or
    • \\.\\[[^]]{1,10}] Match a dot and use a negated character class to match between square brackets and a quantifier that repeats not a closing bracket 1 - 10 times
    • | Or
    • \\([^)]{1,10}\\)\\. Use a negated character class to match between parenthesis and a quantifier that repeats not a closing parenthesis 1 - 10 times
    • | Or
    • \\. (?!\\([^)]+\\)) \\. (?!\\([^)]+\\)) A space, dot and a space if what follows is not anything between parenthesis
  • ) Close positive lookbehind

Java demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM