Regular Expression with Multiple Capture Groups

Question

I've been working on a regular expression to pick apart a bunch of text files that I need to parse into a database. My files are in the following format:

Lorem ipsum dolor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sit amet, consectetur adipiscing elit.

Fusce lacinia sollicitudin lectus id eleifend. Phasellus.

massa sapien, scelerisque in tincidunt et, porttitor eget ante.
In iaculis justo vel quam rhoncus volutpat. Curabitur eros est,
ultrices in elementum eget, venenatis eget mauris. Sed sollicitudin,
nibh sed varius aliquet, neque odio porttitor risus, at sollicitudin

lectus neque sit amet diam.
Aliquam condimentum sapien eu
tellus condimentum suscipit.
Pellentesque in accumsan nunc.

I'm trying to come up with the following capture groups:

Lorem ipsum dolor
sit amet, consectetur adipiscing elit.
Fusce lacinia sollicitudin lectus id eleifend. Phasellus.
massa sapien, scelerisque in tincidunt et, porttitor eget ante.
In iaculis justo vel quam rhoncus volutpat. Curabitur eros est, ultrices in elementum eget, venenatis eget mauris. Sed sollicitudin, nibh sed varius aliquet, neque odio porttitor risus, at sollicitudin

Notes: Everything after the multiline paragraph can be ignored. All of the groups can include letters, numbers, spaces and punctuation. I'm going to be doing some additional post-processing on the text using PHP.

My last try to capture the first 2 parts, which was closer than my other attempts but still didn't work as intended was:

^((?:[a-zA-Z0-9!-~](?: (?! ))?)+?)(?: {2,})((?:[a-zA-Z0-9!-~](?: (?! ))?)+?)

I thought that this would start at the beginning of the file, capture everything up to the point where it encountered multiple spaces then grab the rest of the line.

Answer 1

Try this:

$pattern='~\A(.+?) {2,}(.+?)\R{2,}(.+?)\R{2,}(.+?)(?:\R{2,}|\Z)~s';

preg_match($pattern, $subject, $match);

See it in action on ideone.com

I'm assuming all those   's in your sample text represent regular spaces, and you only used them so we could see that there was more than one space. If you been using SO's code formatting from the beginning, that wouldn't have been necessary. That's the indentation style of formatting; in text formatted with backticks, whitespace still gets collapsed.

I'm also assuming you're reading the whole file into memory, not processing it line-by-line. The regex is pretty straightforward. Starting at the beginning of the text ( \\A ), it reluctantly matches and captures everything it sees ( (.+?) , in single-line mode) until it sees two or more consecutive spaces ( {2,} ).

After that, it reluctantly matches and captures until it sees two or more newlines in a row ( (.+?)\\R{2,} ). Then it does the same thing twice more to capture the second and third paragraphs. The final (?:\\R{2,}|\\Z) is there in case there's no more text after the third paragraph.

\\R , if you're not familiar with it, is a shorthand for any kind of line separator: \\n , \\r , \\r\\n and a few other, less common ones. It's supported by Perl, PHP (PCRE), Ruby 1.9+ (Oniguruma) and a few other flavors, but not (so far) by JavaScript, Python, Java or .NET.

Regular Expression with Multiple Capture Groups

Question

1 answers

solution1
1 ACCPTED 2011-05-08 06:23:09

Regular Expression with Multiple Capture Groups

Question

1 answers

solution1 1 ACCPTED 2011-05-08 06:23:09

solution1
1 ACCPTED 2011-05-08 06:23:09