简体   繁体   English

正则表达式匹配从换行符到括号的所有内容以及搜索词

[英]Regex to match everything from newline upto parenthesis along with a search term

We're trying to parse information that's been output from an DOS based accounting software from the 90s, so we can convert and upload it to a newer system.我们正在尝试解析 90 年代基于 DOS 的会计软件输出的信息,因此我们可以将其转换并上传到更新的系统。 It's mostly information pertaining to each accounting entry and it's output with random tabs, line breaks etc. like this:它主要是与每个会计分录有关的信息,它以随机制表符、换行符等形式输出,如下所示:

#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        #Bank: Citibank (R:2432;
    L:28
)

However, whats clear is that the information for each entry starts on a new line and ends with a )但是,很清楚的是,每个条目的信息都从新行开始并以)结尾

How can a regex that'll start looking for a term in that line all the way upto a ) be written?如何编写将开始在该行中一直到 a )查找术语的正则表达式?

For example in the data above, we're looking for the string Dr using preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches) and it matches as follows:例如在上面的数据中,我们正在使用preg_match_all('/^.*\\b(?:Dr)\\b.*$/m', $dos, $matches)查找字符串Dr ,它匹配如下:

Array
(
    [0] => Array
        (
            [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
            [1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
            [2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        )

)

You can see from the second result in the array that it's omitted #Bank: Citibank (R:2432; L:28) since it's on a separate line, but that data is still part of the line above it.您可以从数组中的第二个结果中看到,它被省略了#Bank: Citibank (R:2432; L:28)因为它位于单独的行上,但该数据仍然是其上方行的一部分。

How can the regex we're using be modified to match upto the next ) regardless if it's on the same line or next line or even few more lines below?我们正在使用的正则表达式如何修改以匹配下一个)无论它是在同一行还是下一行,甚至是下面的几行? So the result will be:所以结果将是:

Array
(
    [0] => Array
        (
            [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
            [1] => #Ch. No. 759263 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
            [2] => #Ch. No. 395159 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;L:28)
        )

)

You could use a negated character class [^ to match any char except the parenthesis which will also match a newline.您可以使用否定字符类[^来匹配除括号之外的任何字符,括号也将匹配换行符。

After the match you can replace all whitespace chars with a single space.匹配后,您可以用一个空格替换所有空白字符。

^.*\bDr\b[^()]*\([^()]+\)

That will match那会匹配

  • ^ Start of string ^字符串开始
  • .*\\bDr\\b Match 0+ times any char except a newline and then match Dr between word boundaries (Or match #Dr\\b if it always start with # ) .*\\bDr\\b匹配 0+ 次除换行符以外的任何字符,然后匹配单词边界之间的 Dr (或者匹配#Dr\\b如果它总是以#开头)
  • [^()]* Match 0+ times any char except parenthesis [^()]*匹配 0+ 次除括号外的任何字符
  • \\( Match ( \\(匹配(
  • [^()]+ Match 1+ times any char except parenthesis (if there has to be at least a single char not being ( ) in between [^()]+匹配 1+ 次除括号之外的任何字符(如果必须至少有一个字符不是( )
  • \\) Match ) \\)匹配)

Regex demo |正则表达式演示| Php demo php 演示

For example例如

$re = '/^.*\bDr\b[^()]*\([^()]+\)/m';
$str = '#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432;
L:28)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
#Bank: Citibank (R:2432;
L:28
)

#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997
        #Bank: Citibank (R:2432;
    L:28
)';

$result = preg_match_all($re, $str, $matches);
$result = array_map(function($x) {
    return preg_replace("/\s+/", ' ', $x);
}, $matches[0]);
print_r($result);

Output输出

Array
(
    [0] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
    [1] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
    [2] => #Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28 )
)

According to @CBroe comment I came up with this:根据@CBroe 的评论,我想出了这个:

/(#[^\\)\\n]*(?:#Dr).*\\)\\n*)/gsU

  • #[^\\)\\n]* -> starts with # and prevent to search through all characters that pass ) or \\n (new line). #[^\\)\\n]* -> 以#开头并阻止搜索所有通过)\\n (新行)的字符。

  • (?:#Dr) -> the search string in none capturing group. (?:#Dr) -> 无捕获组中的搜索字符串。

  • .*\\)\\n* -> continue until meet a ) or a \\n (new line). .*\\)\\n* -> 继续直到遇到 a )\\n (换行)。

  • gsU -> used flags: g: global search, s: matches new lines, U: ungreedy quantifiers. gsU -> used flags:g:全局搜索,s:匹配新行,U:非贪婪量词。

Demo演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM