简体   繁体   English

PHP正则表达式可以匹配所有之间 <body style=…> 和 </body> 标签

[英]PHP Regex to match everything between <body style=…> and </body> tag

I've got a cURL function that grabs everything on a specified page, but I only want the elements between the body tags. 我有一个cURL函数,可以抓取指定页面上的所有内容,但是我只想要body标签之间的元素。 I found this nifty regex to match everything between <body> and </body> , which worked. 我发现这个漂亮的正则表达式可以匹配<body></body>之间的所有内容,并且有效。 But then I realized that one of the pages I need to use cURL on actually has a body tag with style info within it, so that what I actually want to match is everything between <body style=...> and </body> . 但后来我意识到,我需要使用卷曲上实际的一个页面有一个body标签与在它的风格的信息,让我真正想匹配之间的一切<body style=...></body> Does anyone know the regex expression to match that? 有谁知道正则表达式来匹配它? Here's all of my code thus far... 到目前为止,这是我的所有代码...

<?php
    error_reporting(E_ALL); 
    ini_set("display_errors", "1");

    $pageToLoad = $_POST['load'];

        function get_data($url) {
            $ch = curl_init();
            $timeout = 5;
            curl_setopt($ch, CURLOPT_HEADER, 0);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
            curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt ($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
            $data = curl_exec($ch);
            curl_close($ch);
            return $data;
        }

        $html = get_data($pageToLoad);
        $newHtml = preg_match("~<body[^>]*>(.*?)</body>~si", $html, $newHtml);
        print_r($newHtml);
?>

The simplest way is using a regex like this one: 最简单的方法是使用这样的正则表达式:

preg_match('|body[^>]*>(.*?)(?=\</body)|si',$html,$match); 

echo $match[1]; 

You want to use the S and I modifier for the Regex to span multiple lines and case insensitive. 您想对正则表达式使用S和I修饰符以跨越多行并且不区分大小写。

It can be a dicey proposition trying to find a pattern in html when you're trying include attributes as part of your search pattern. 当您尝试将属性作为搜索模式的一部分时,尝试在html中查找模式可能是一个容易的提议。 For example an attribute value can be single, or double quoted, and most parsers will be able to manage even if somebody forgot to quote something, or mismatched the quotes. 例如,一个属性值可以是单引号或双引号,并且即使有人忘记引用某项或引号不匹配,大多数解析器也将能够进行管理。 Since you're just looking for a specific attribute name, its easier, but there are still gotchas, such as if the attribute name you're looking for exists as a value in another attribute. 由于您只是在寻找特定的属性名称,因此它比较容易,但是仍然存在一些陷阱,例如,您要查找的属性名称是否作为另一个属性中的值存在。

(Heck, your original simple regex would incorrectly match some improbable strings like <bodycustomelement>...</body> . (哎呀,您原来的简单正则表达式会错误地匹配诸如<bodycustomelement>...</body>类的一些不太可能的字符串。

Since a style attribute is almost always followed by an equal sign, I will use that fact to find it. 由于样式属性几乎总是跟等号,因此我将使用该事实来查找它。 I will also make sure I match a body element, and not some improbable mutant like the example above. 我还将确保我匹配一个body元素,而不是上面的示例中某些不可能的突变。

<body\s[^>]*style\s*=[^>]*>(.*?)</body>

REY REY

This is essentially the same as your original regex but, with \\s[^>]*style\\s*= in the middle of it. 这与原始正则表达式基本相同,但是中间是\\s[^>]*style\\s*=

  1. \\s ensures that there is space after the body element so that it can only be a body element. \\s确保在body元素之后有空间,使其只能是body元素。
  2. [^>]* matches any character but an > 0 or more times [^>]*与任何字符匹配,但> 0或多次
  3. style matches the string "style" style匹配字符串“样式”
  4. \\s* allows for white space in between style and the equal sign \\s*在样式和等号之间留有空格
  5. = matches the string "=" =匹配字符串“ =“

I'm hard pressed to think of an example that would befuddle this regex, that wouldn't also cause problems with a parser. 我很难想到一个使该正则表达式迷惑的示例,它也不会导致解析器出现问题。 I suppose if somebody added white space between the < and body in the opening of the element, or they had space or any other characters in the closing of body . 我想如果有人在元素的开头在<body之间添加了空格,或者他们在body的结尾处有空格或任何其他字符。 Plus somebody might just omit the closing body element all together. 另外,有人可能只是一起省略了封闭的身体元素。

You can keep adding to the regex to handle the examples, but probably for any case you'll encounter in the wild, what I've given will work fine. 您可以继续添加正则表达式来处理示例,但是对于在野外遇到的任何情况,我给出的内容都可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM