简体   繁体   中英

Regex to partition multiline string

Consider a multiline string consisting of N lines , like the following:

Line 1 text
Line 2 text
Line 3 text
...
Line n-1 text
Line n text
anchor=value
Line n+2 text
Line n+3 text
Line n+4 text
...
Line N text

The anchor key does not appear inside any of the lines and there may be spaces before the anchor, as well as around the = sign that follows it.

I need a regex that partitions the above string into 3 groups:

  1. Line 1 to Line n (inclusive)
  2. Anchor line (partition point)
  3. Line n+2 to Line N (inclusive)

The closest I have got to the solution is

(?s)^(?:(?!anchor\s*=\s*).)+?\r|\nanchor\s*=\s*([^\r\n]+)(?:\r|\n)(.*)

but the above regex includes the entire text in the first matching group and populates the remaining 2 groups as expected.

An additional requirement is that the regex has to be as fast as possible, since it will be applied to large amounts of data. Note also that processing via a single regex is the only option in this use case.

Any ideas?

What about this regex?

(?s)^(.*?)(anchor\\s*\\=\\s*[^\\r\\n]+)(.*?)

Or, to match the end of string,

(?s)^(.*?)(anchor\\s*\\=\\s*[^\\r\\n]+)(.*?)$ ?

If you need speed huge strings and regex is not the way to go. You have to have the entire string in memory to be able to use regex to tokenize it. Use Reader / InputStreams instead would be my recommendation.

Well, you could first get the anchor, then split on it:

String anchor = str.replaceAll("(?ms).*?(anchor\\s*=.*?)$.*", "$1");
String lineParts = str.split("\\Q" + anchor + "\\E");

The "m" flag makes ^ and $ match start/end of lines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM