简体   繁体   English

带preg_match_all的正则表达式

[英]Regular expression with preg_match_all

i got a problem occuring when using regular expressions: 使用正则表达式时出现问题:

php> $html = "<html><head><body><h1>hello world</h1><img src=\"data:rawIMGdata\" /><p/><img src=\"sdfsdf.jpg\" title=\"pic1\" /><p/><div class=\"myclass\"><img src=\"data:imageData\" /></div><img alt=\"bla\" src=\"bla.jpg\" title=\"bla\" /></body></html>";
php> $pat = '/<img.*src="(data:.*)"/m';
php> preg_match_all($pat, $html, $matching);
php> var_dump($matching);
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(169) "<img src="data:rawIMGdata" /><p/><img src="sdfsdf.jpg" title="pic1" /><p/><div class="myclass"><img src="data:imageData" /></div><img alt="bla" src="bla.jpg" title="bla""
  }
  [1]=>
  array(1) {
    [0]=>
    string(63) "data:imageData" /></div><img alt="bla" src="bla.jpg" title="bla"
  }  
}

My expected output would be just an occurence of "data:imageData" in the second array and moreover there should be two matches ("data:rawIMGdata") 我的预期输出将只是第二个数组中“ data:imageData”的出现,而且应该有两个匹配项(“ data:rawIMGdata”)

Did i define my regex a wrong way? 我定义正则表达式的方式有误吗?

Regards, Broncko 问候,Broncko

You might want to consider using DOM Document for parsing HTML, although if this example is a complex as it is going to get then you can probably get away with regex; 您可能要考虑使用DOM Document来解析HTML,尽管如果此示例很复杂,那么您可能可以不用正则表达式了。 DOM Document will always be more robust though. DOM文档将始终更加强大。

Try this: 尝试这个:

/<img.*?src="(data:[^"]*)"/m

The ? sets the * to be non-greedy (so it will get the minimum match, by default it grabs as much as it can) 将*设置为非贪婪(因此它将获得最小匹配项,默认情况下会尽可能多地捕获)

And rather than match anything, you can match anything that isn't a " with [^"]. 除了匹配任何内容外,您还可以将所有非“与[^”]匹配。

The .* before was being greedy and matching up to the " in another element 之前的。*很贪婪,并且与另一个元素中的“

You're basically telling PCRE to grab too much information. 您基本上是在告诉PCRE捕获太多信息。 Regular expression matching operators will match as much as possible, which is why you're getting so much extra stuff in your matches. 正则表达式匹配运算符将尽可能地匹配,这就是为什么您在匹配中获得大量额外内容的原因。 Firstly, switch to using the non-greedy variants for matching the initial whitespace, and or matching the contents of the element. 首先,切换到使用非贪婪变体来匹配初始空白和/或匹配元素的内容。 Secondly, introduce a proper delimiter to match the end of the attribute's contents. 其次,引入适当的定界符以匹配属性内容的结尾。 Here's the pattern you ought to be using: 这是您应该使用的模式:

$pat = '/<img.*?src="(data:[^"]*)"/m';

If you are trying to parse valid (almost valid) HTML you may try using tools just for parsing XML like DOM which allows you to browse trough XML quite effectively. 如果您尝试解析有效​​(几乎有效)的HTML,则可以尝试使用仅用于解析XML的工具(DOM ,该工具可以使您非常有效地浏览XML。

RegExp will definitely do the job, but once you swap ' for " or html changes from <img src=""> to <img class="" src=""> you may have an issue. 正则表达式肯定会做的工作,但一旦你换'"从或HTML变化<img src=""><img class="" src="">您可能会出现问题。

XML parsing utils also usually take care about escaping and "unescaping" arguments, handles duplicate arguments. XML解析实用程序通常也要注意转义和“取消转义”参数,处理重复的参数。

For example use DOMxPath (here's [tutorial] ): 例如,使用DOMxPath (此处为[tutorial] ):

$doc = new DOMDocument;
$doc->Load('book.xml');
$xpath = new DOMXPath($doc);
$query = '//img';

$entries = $xpath->query($query);

foreach ($entries as $entry) {
    if( !$entry->hasElement('src')){
        continue;
    }

    $src = $entry->getAttribute( 'src');

    if( strncmp( $src, 'data:', 5) != 0){
       continue;
    }

    $content = substr( $src, 5);

    // Do whatever you need
}

Try using a 'lazy' expression - 尝试使用“惰性”表达式-

$pat = '/<img(.*?)src="(data:.*)"/m';

More information: http://www.regular-expressions.info/repeat.html 详细信息: http : //www.regular-expressions.info/repeat.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM