简体   繁体   中英

Java Regex: Match text between two strings with boundary conditions

I want to match text between two Strings, but the starting String has strict boundary conditions.

Sample input:

start
From: h
From:b
 xyz
Subject: 
end

I need to match between From: and Subject: .

If I use (From:.*).*(Subject:) with dotall, it produces

From: h
From:b
 xyz
Subject:

but I need only

From:b
 xyz
Subject:

because the starting string has strict boundary conditions. This is necessary because the starting String could be anywhere in the document, and then the above regex will match a big text rather than just few lines.

%%%%%%%%%%%% Problem redefined %%%%%%%%%%%%%% I have text in which I need to match:

From:<any text>
To:<any text>
Subject:<any text>

The catch is that: All the three components can be in one line, could be separated by one newline, or could be separated by 2 newlines... There are text before and after the desired match which could contain From:<any text> , that's why I need strict boundaries.

Try this out:

String input = "start From: h From:b xyz Subject: end";
Matcher matcher = Pattern.compile("(?<=^((?!From:).)*(From: [A-Za-z0-9] ))(.+?)(Subject:)").matcher(input);
if (matcher.find())
{
    System.out.println(matcher.group());
}

Output: From:b xyz Subject: .


Explanation of regex ( (?<=^((?!From:).)*(From: [A-Za-z0-9] ))(.+?)(Subject:) ):

  • (?<= start looking behind
  • ^ the start of the string
  • ((?!From:).) if looking ahead and you can't see "From:" then match any character
  • * matches the previous statement zero or more times
  • (From: [A-Za-z0-9] )) matches the first "From:" and it's contents
  • ) stop looking behind
  • (.+?) matches the string we are looking for
  • (Subject:) matches the subject field

Instead of using .* in DOTALL mode, I suggest you match one line at a time, after asserting that the line doesn't start with From: .

"(?m)^From:.*[\r\n]+(?:(?!From:).*[\r\n]+)*Subject:.*$"

That's the minimum implementation. Depending on how your text is structured, it could still match too much or too slowly (especially in cases where no match is possible). Here's a more robust version:

"(?m)^(?>From:.*[\r\n]+)(?>(?!From:|Subject:).*[\r\n]+)*+Subject:.*$"

Use the multiline modifier and negative lookahead:

(?s)From:((?!From:).)*?Subject: @ regex101

NOTE: the regex101 fiddle contains the live regex and test data.

Simply:

From\:\w*(?!From\:\w*)\n*\w*\n*Subject:\w*

Demo: https://regex101.com/r/mX9kC7/3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM