[英]Regex to match everything from newline upto parenthesis along with a search term
We're trying to parse information that's been output from an DOS based accounting software from the 90s, so we can convert and upload it to a newer system.我们正在尝试解析 90 年代基于 DOS 的会计软件输出的信息,因此我们可以将其转换并上传到更新的系统。 It's mostly information pertaining to each accounting entry and it's output with random tabs, line breaks etc. like this:
它主要是与每个会计分录有关的信息,它以随机制表符、换行符等形式输出,如下所示:
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
However, whats clear is that the information for each entry starts on a new line and ends with a )
但是,很清楚的是,每个条目的信息都从新行开始并以
)
结尾
How can a regex that'll start looking for a term in that line all the way upto a )
be written?如何编写将开始在该行中一直到 a
)
查找术语的正则表达式?
For example in the data above, we're looking for the string Dr
using preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches)
and it matches as follows:例如在上面的数据中,我们正在使用
preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches)
查找字符串Dr
,它匹配如下:
Array
(
[0] => Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
[1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
[2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
)
)
You can see from the second result in the array that it's omitted #Bank: Citibank (R:2432; L:28)
since it's on a separate line, but that data is still part of the line above it.您可以从数组中的第二个结果中看到,它被省略了
#Bank: Citibank (R:2432; L:28)
因为它位于单独的行上,但该数据仍然是其上方行的一部分。
How can the regex we're using be modified to match upto the next )
regardless if it's on the same line or next line or even few more lines below?我们正在使用的正则表达式如何修改以匹配下一个
)
无论它是在同一行还是下一行,甚至是下面的几行? So the result will be:所以结果将是:
Array
(
[0] => Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
[1] => #Ch. No. 759263 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
[2] => #Ch. No. 395159 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
)
)
You could use a negated character class [^
to match any char except the parenthesis which will also match a newline.您可以使用否定字符类
[^
来匹配除括号之外的任何字符,括号也将匹配换行符。
After the match you can replace all whitespace chars with a single space.匹配后,您可以用一个空格替换所有空白字符。
^.*\bDr\b[^()]*\([^()]+\)
That will match那会匹配
^
Start of string ^
字符串开始.*\\bDr\\b
Match 0+ times any char except a newline and then match Dr between word boundaries (Or match #Dr\\b
if it always start with #
) .*\\bDr\\b
匹配 0+ 次除换行符以外的任何字符,然后匹配单词边界之间的 Dr (或者匹配#Dr\\b
如果它总是以#
开头)[^()]*
Match 0+ times any char except parenthesis [^()]*
匹配 0+ 次除括号外的任何字符\\(
Match (
\\(
匹配(
[^()]+
Match 1+ times any char except parenthesis (if there has to be at least a single char not being (
)
in between [^()]+
匹配 1+ 次除括号之外的任何字符(如果必须至少有一个字符不是(
)
\\)
Match )
\\)
匹配)
Regex demo |正则表达式演示| Php demo
php 演示
For example例如
$re = '/^.*\bDr\b[^()]*\([^()]+\)/m';
$str = '#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)';
$result = preg_match_all($re, $str, $matches);
$result = array_map(function($x) {
return preg_replace("/\s+/", ' ', $x);
}, $matches[0]);
print_r($result);
Output输出
Array
(
[0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
[1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
[2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
)
According to @CBroe comment I came up with this:根据@CBroe 的评论,我想出了这个:
/(#[^\\)\\n]*(?:#Dr).*\\)\\n*)/gsU
#[^\\)\\n]*
-> starts with #
and prevent to search through all characters that pass )
or \\n
(new line). #[^\\)\\n]*
-> 以#
开头并阻止搜索所有通过)
或\\n
(新行)的字符。
(?:#Dr)
-> the search string in none capturing group. (?:#Dr)
-> 无捕获组中的搜索字符串。
.*\\)\\n*
-> continue until meet a )
or a \\n
(new line). .*\\)\\n*
-> 继续直到遇到 a )
或\\n
(换行)。
gsU
-> used flags: g: global search, s: matches new lines, U: ungreedy quantifiers. gsU
-> used flags:g:全局搜索,s:匹配新行,U:非贪婪量词。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.