We're trying to parse information that's been output from an DOS based accounting software from the 90s, so we can convert and upload it to a newer system. It's mostly information pertaining to each accounting entry and it's output with random tabs, line breaks etc. like this:
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
However, whats clear is that the information for each entry starts on a new line and ends with a )
How can a regex that'll start looking for a term in that line all the way upto a )
be written?
For example in the data above, we're looking for the string Dr
using preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches)
and it matches as follows:
Array
(
[0] => Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
[1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
[2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
)
)
You can see from the second result in the array that it's omitted #Bank: Citibank (R:2432; L:28)
since it's on a separate line, but that data is still part of the line above it.
How can the regex we're using be modified to match upto the next )
regardless if it's on the same line or next line or even few more lines below? So the result will be:
Array
(
[0] => Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
[1] => #Ch. No. 759263 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
[2] => #Ch. No. 395159 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
)
)
You could use a negated character class [^
to match any char except the parenthesis which will also match a newline.
After the match you can replace all whitespace chars with a single space.
^.*\bDr\b[^()]*\([^()]+\)
That will match
^
Start of string .*\\bDr\\b
Match 0+ times any char except a newline and then match Dr between word boundaries (Or match #Dr\\b
if it always start with #
) [^()]*
Match 0+ times any char except parenthesis \\(
Match (
[^()]+
Match 1+ times any char except parenthesis (if there has to be at least a single char not being (
)
in between \\)
Match )
For example
$re = '/^.*\bDr\b[^()]*\([^()]+\)/m';
$str = '#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)';
$result = preg_match_all($re, $str, $matches);
$result = array_map(function($x) {
return preg_replace("/\s+/", ' ', $x);
}, $matches[0]);
print_r($result);
Output
Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
[1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
[2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
)
According to @CBroe comment I came up with this:
/(#[^\\)\\n]*(?:#Dr).*\\)\\n*)/gsU
#[^\\)\\n]*
-> starts with #
and prevent to search through all characters that pass )
or \\n
(new line).
(?:#Dr)
-> the search string in none capturing group.
.*\\)\\n*
-> continue until meet a )
or a \\n
(new line).
gsU
-> used flags: g: global search, s: matches new lines, U: ungreedy quantifiers.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.