简体   繁体   English

为什么这个正则表达式与php中的第一个结果不匹配?

[英]Why does this regular expression not match the first result in php?

Here is my regular expression: 这是我的正则表达式:

❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱

Here is the test text ( online demo in javascript where it works fine): 这是测试文本( 使用javascript的在线演示可以正常工作):

Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla imperdiet❰❮6❯⦓“ Lorem ipsum dolor坐着,献身自私。 Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. ❯⦓7❯⦓bi bi bi bi sit sit Du Du Du Du do,,,,,,,,,,❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓❯⦓ a8metEtiam在congue turpis。 Cras volutpat est mauris. Cras volutpat est mauris。 Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Nulla imperdiet libero vitae metus semper,坐在amet dictum lectus placerat。 Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean luctus at nibh eget scelerisque. Aenean在venenatislibero。⦔❱❰❮9-10❯⦓Aeneanluctus在小提琴。 Phasellus vel consequat dui, eu euismod lacus. 菜豆,eueusod lacus。 Nam id tellus tincidunt, tristique quam eu, cursus nulla. Nam id Tellus Tincidunt,Trisique quam eu,cursus nulla。 Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. Suspendisse ac nibh lacinia,暂时性的,无效。 .⦔❱ eu euismod. .⦔❱eu euismod。

But It does not work in php. 但这在php中不起作用。 That is, it does not retrive the first match: ie., from ❰❮6❯⦓“ to vitae.⦔❱ . 也就是说,它不会检索第一个匹配项:即,从❰❮6❯⦓“vitae.⦔❱ Intriguingly, if I remove the Unicode double quotes charterer (“), it works fine, but adding it, makes it not to match the first match. 有趣的是,如果我删除了Unicode双引号租约者(“),它可以正常工作,但是添加它会使它与第一个匹配项不匹配。 Why is this? 为什么是这样? and How can this be avoided? 以及如何避免这种情况?


Explanation of the regex: I wanted to match content between and , if they are the only content excluding digital content inbetween and . 正则表达式的说明:我想匹配之间的内容 ,如果他们是不包括数字内容的插图中的唯一内容

Example for Match: 匹配示例:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit. ❰❮6❯⦓Loremipsum dolor坐下,私服贴身小精灵。 Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱ 毛豆的悬浮,evel ornare velit的生命。

Example for Not a Match: 不匹配示例:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit.⦔ Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.❱ ❰❮6❯⦓Loremipsum dolor坐下,保持良好的自闭状态。⦔Suspendisse gravida consectetur毛里斯,得到ornare velit带来的生命。


My PHP Code: 我的PHP代码:

<?php
$subject = "Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris,
         eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱ eu euismod.";


$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#';
preg_match_all($pattern, $subject, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';    
?>

output: 输出:

Array
(
    [0] => Array
        (
            [0] => ❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱
            [1] => ❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱
            [2] => ❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱
        )

    [1] => Array
        (
            [0] => ❮7❯
            [1] => ❮8❯
            [2] => ❮9-10❯
        )

    [2] => Array
        (
            [0] => Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.
            [1] => Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.
            [2] => Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .
        )

)

You're matching unicode characters, but you haven't included the unicode modifier which means that unicode characters won't be seen as what they actually are. 您正在匹配unicode字符,但尚未包括unicode修饰符 ,这意味着unicode字符将不会被视为它们的实际含义。

From the manual : 手册

u (PCRE_UTF8) u(PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. 此修改器打开了与Perl不兼容的PCRE的其他功能。 Pattern and subject strings are treated as UTF-8. 模式和主题字符串被视为UTF-8。 This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. Unix上的PHP 4.1.0或更高版本以及win32上的PHP 4.2.3均可使用此修饰符。 UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. 从PHP 4.3.5开始,将检查模式和主题的UTF-8有效性。 An invalid subject will cause the preg_* function to match nothing; 无效的主题将导致preg_*函数不匹配。 an invalid pattern will trigger an error of level E_WARNING . 无效的模式将触发E_WARNING级别的错误。 Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); 自PHP 5.3.4起(分别为PCRE 7.3 2007-08-28),五个和六个八位字节的UTF-8序列被视为无效; formerly those have been regarded as valid UTF-8. 以前那些被认为是有效的UTF-8。

To fix your problem, simply append u to your regex: 要解决您的问题,只需将u附加到正则表达式中即可:

$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#u';
// Add the unicode modifier            ^

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM