简体   繁体   中英

Regex to match everything from newline upto parenthesis along with a search term

We're trying to parse information that's been output from an DOS based accounting software from the 90s, so we can convert and upload it to a newer system. It's mostly information pertaining to each accounting entry and it's output with random tabs, line breaks etc. like this:

#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        #Bank: Citibank (R:2432;
    L:28
)

However, whats clear is that the information for each entry starts on a new line and ends with a )

How can a regex that'll start looking for a term in that line all the way upto a ) be written?

For example in the data above, we're looking for the string Dr using preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches) and it matches as follows:

Array
(
    [0] => Array
        (
            [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
            [1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
            [2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        )

)

You can see from the second result in the array that it's omitted #Bank: Citibank (R:2432; L:28) since it's on a separate line, but that data is still part of the line above it.

How can the regex we're using be modified to match upto the next ) regardless if it's on the same line or next line or even few more lines below? So the result will be:

Array
(
    [0] => Array
        (
            [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
            [1] => #Ch. No. 759263 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
            [2] => #Ch. No. 395159 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
        )

)

You could use a negated character class [^ to match any char except the parenthesis which will also match a newline.

After the match you can replace all whitespace chars with a single space.

^.*\bDr\b[^()]*\([^()]+\)

That will match

  • ^ Start of string
  • .*\\bDr\\b Match 0+ times any char except a newline and then match Dr between word boundaries (Or match #Dr\\b if it always start with # )
  • [^()]* Match 0+ times any char except parenthesis
  • \\( Match (
  • [^()]+ Match 1+ times any char except parenthesis (if there has to be at least a single char not being ( ) in between
  • \\) Match )

Regex demo | Php demo

For example

$re = '/^.*\bDr\b[^()]*\([^()]+\)/m';
$str = '#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        #Bank: Citibank (R:2432;
    L:28
)';

$result = preg_match_all($re, $str, $matches);
$result = array_map(function($x) {
    return preg_replace("/\s+/", ' ', $x);
}, $matches[0]);
print_r($result);

Output

Array
(
    [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
    [1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
    [2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
)

According to @CBroe comment I came up with this:

/(#[^\\)\\n]*(?:#Dr).*\\)\\n*)/gsU

  • #[^\\)\\n]* -> starts with # and prevent to search through all characters that pass ) or \\n (new line).

  • (?:#Dr) -> the search string in none capturing group.

  • .*\\)\\n* -> continue until meet a ) or a \\n (new line).

  • gsU -> used flags: g: global search, s: matches new lines, U: ungreedy quantifiers.

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM