简体   繁体   中英

Split string in java using regex by combing both look-ahead and look-behind

I want to split a string in java using a regular expression but I want to match it from forward and from behind also for not missing any of the string.

For example:

test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3"

I have the above string and expected output should be:

Expected Output:

test 
<img border=\"0\" src=\"test\" />
hi
<img border=\"0\" src=\"test\" /> 
 test3"

Below is what I have tried

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestParse {

    private static final String IMG_S_LookBehind = "(?<=\\>)";
    private static final String IMG_S_LookAHead = "(?=<img .*?\\>)";

    static String test = "test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3";

    static Pattern newPattern(String tag) {
        return Pattern.compile(String.format("(<%s\\s*([^>]*)>)(.*)(</%s>)", tag, tag));
    }

    public static void main(String[] args) {
//      Pattern re = newPattern("b");
//      Matcher m = re.matcher(test);
//      
//      if (m.matches()) {
//          for (int i = 0; i <= m.groupCount(); i++) {
//              System.out.printf("[%d]: [%s]\n", i, m.group(i));
//          }
//      }
        String[] split = test.split(IMG_S_LookAHead);
        System.out.println(split);
    }
}

OUTPUT:

 test 
    <img border=\"0\" src=\"test\" />hi
    <img border=\"0\" src=\"test\" /> test3"

I tried looking from behind too but somehow it fails to give me the expected output. Any clue on this will be appreciated.

I wouldn't approach this via a regex split, because it is difficult to phrase/detect boundaries between tags and non-tags etc. Instead, I would try to match either tags, or anything which is not a tag. Here is a working sample script:

String input = "test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3";
String pattern = "<[^>]+>|((?!<[^>]+>).)*";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
while (m.find( )) {
    System.out.println(m.group(0));
}

This prints:

test 
<img border="0" src="test" />
hi
<img border=\"0\" src=\"test\" />
 test3

Perhaps one portion of the regex needs to be explained:

((?!<[^>]+>).)*

This will match anything, so long as it does not encounter the start of a tag. The trick is called "tempered dot," because it is really just .* with a check at each step to make sure that a tag is not intersected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM