简体   繁体   English

正则表达式返回参考负前瞻原子团

[英]regex back reference negative lookahead atomic group

I want to match a single or double quote mark, followed by any amount of characters that are not the character just matched, followed by one of the character matched: 我想匹配一个单引号或一个双引号,然后匹配不匹配该字符的任意数量的字符,然后匹配一个匹配的字符:

"--'__'--"

Should match by the double quotes at each end. 应在两端用双引号匹配。 However, I want the match to be possessive in that any characters that have already been tested should not be included in any future matches: 但是,我希望该比赛具有占有欲,因为任何经过测试的字符都不应包含在以后的任何比赛中:

"--'__'--

Should not match because the double quote at the beginning is never followed by another one at the end. 如果匹配,因为在开始的双引号从不接着又一个结尾。 I have come up with: 我想出了:

(?P<q>['"])(?>((?!(?P=q)).)*)(?P=q)

But this still matches my second string example above by the single quotes in the middle. 但这仍然与我上面的第二个字符串示例(中间的单引号)匹配。 I don't understand why the atomic group doesn't accomplish this. 我不明白为什么原子团不能做到这一点。 I have not been able to accomplish this with any other arrangement of atomic grouping either. 我也无法通过原子分组的任何其他安排来实现此目的。

Also, if it is possible at all to match only the characters in between the quotes while asserting that the quotes are present that would be excellent. 同样,如果在断言存在引号的情况下根本只匹配引号之间的字符,那将是极好的。 Because lookbehind assertions are fixed width I can't use a back reference to assert that the captured group of either single or double quotes occurs prior to the negative lookahead. 因为后向断言是固定宽度的,所以我不能使用反向引用来断言所捕获的单引号或双引号组发生在否定前瞻之前。

Assuming there will be only one valid quoted substring per line, this may be a good starting point: 假设每行只有一个有效的带引号的子字符串,这可能是一个很好的起点:

<?php // test.php Rev:20120105_1800
// Return array of valid quoted substrings, one per line.
function getArrayOfOnePerLineValidQuotedSubstrings($text) {
    $re = '%
        # Match line w/1 valid "single" or "double" substring.
        ^               # Anchor to start of line.
        [^\'"]*         # Everything up to first quote.
        (?|             # Branch reset group $1: Contents.
          "([^"]*)"     # Either $1.1 Double quoted,
        | \'([^\']*)\'  # or $1.2 Single quoted contents.
        )               # End $1: branch reset group.
        [^\'"]*         # Everything after quoted sub-string.
        $               # Anchor to end of line.
        %xm';
    if (preg_match_all($re, $text, $matches)) {
        return $matches[1];
    }
    return array();
}
// Fetch test data from file.
$data = file_get_contents('testdata.txt');
// Get array of valid quoted substrings, one per line.
$output = getArrayOfOnePerLineValidQuotedSubstrings($data);
// Display results.
$count = count($output);
printf("%d matches found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
    printf("  match[%d] = {%s}\n", $i + 1, $output[$i]);
}
?>

This regex matches each line that contains one valid quoted substring and skips over lines that have invalid (ie "--'__'-- which has an unbalanced double quoted substring) or no quoted substrings. For lines which match, the contents of the valid quoted substring are returned in group $1 . The function returns an array of the matched substrings. 此正则表达式匹配包含一个有效的带引号子字符串的每一行,并跳过无效的行(即"--'__'--具有不平衡的双引号子字符串)或不带引号的子字符串。对于匹配的行,有效的带引号的子字符串将在组$1中返回。该函数返回匹配的子字符串的数组。

If your data will contain more than one substring per line, or if the quoted substrings or stuff between quoted substrings may contain escaped quotes, then a more complex solution may be formulated. 如果您的数据每行包含一个以上的子字符串,或者如果引号的子字符串或引号的子字符串之间的内容可能包含转义的引号,则可以制定更复杂的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM