简体   繁体   English

正则表达式匹配一系列P标签

[英]Regex matching succession of P tags

This is a fun little one I've been working on. 这是我一直在努力的一个有趣的小家伙。 I've found many solutions, but none are really the right match. 我找到了很多解决方案,但没有一个是真正合适的解决方案。 The goal is this "Match p tags only if there are 3 or more in a row" 目标是“仅在连续3个或更多的情况下匹配p个标签”

So I feel like this should be right, but it's not. 所以我觉得这应该是对的,但事实并非如此。

<p.*>(.*)<\/p>(?=\s?<p){3,}

Basically in my words this says: 用我的话说,基本上是这样的:

  • Match ap tag with anything inside the tag 将ap标签与标签内的任何内容匹配
  • Match anything until you see a closing P tag 匹配所有内容,直到看到结束的P标签
  • ONLY match the preceding (above 2 lines) iff followed by 仅匹配前面(两行以上)的iff后跟
    • a whitespace char (maybe) and then a < p 一个空格字符(也许),然后一个<p
    • If that occurs 3 or more times 如果发生3次以上

The issue is that this works well in Javascript but not in PHP. 问题是,这在Javascript中效果很好,但在PHP中效果不好。 PHP says PHP说

Compilation failed: nothing to repeat at offset 28

I've tried different rounds of parens to give it that "nothing to repeat" but that causes false regex. 我已经尝试过不同轮次的括号,使其“没有重复”,但这会导致错误的正则表达式。

And yes, this is for web scraping but no I'm doing research not doing evil things. 是的,这是用于Web抓取,但是不,我在做研究而不是在做恶事。

Any ideas maybe? 有什么想法吗? thanks! 谢谢!

A state machine XML parser (a SAX parser) seems most appropriate to me. 状态机XML解析器(SAX解析器)对我来说似乎最合适。 Here is an example: 这是一个例子:

class StateHelper {

    function __construct($filename) {
        $this->p_count = 0;
        $this->p_elements = array();
        $this->in_p = FALSE;
        $this->minimum_in_succession = 2;
        $this->successive_element_data = array();
        $parser = xml_parser_create();
        xml_set_element_handler($parser, array($this, 'start_element'), NULL);
        xml_set_character_data_handler($parser, array($this, 'character_data'));

        $fp = fopen($filename, 'r')
            or die ("Cannot open $filename");

        while ($data = fread($fp, 4096)) {
            xml_parse($parser, $data, feof($fp)) or 
                die(sprintf('XML ERROR: %s at line %d',
                xml_error_string(xml_get_error_code($parser)),
                xml_get_current_line_number($parser)));
        }
        xml_parser_free($parser);
        $this->start_element(NULL, "end", NULL);
    }

    function start_element($parser, $element_name, $element_attrs) {
        if ($element_name == 'P') {
            $this->p_count += 1;
            $this->in_p = TRUE;
        } else {
            if ($this->p_count >= $this->minimum_in_succession) {
                $this->successive_element_data[] = $this->p_elements;
            }
            $this->p_elements = array();
            $this->p_count = 0;
            $this->in_p = FALSE;
        }
    }

    function character_data($parser, $data) {
        if ($this->in_p && strlen(trim($data))) {
            $this->p_elements[] = $data;
        }
    }
}

$parseState = new StateHelper("example.html");
print_r($parseState->successive_element_data);

example.html* example.html *

<html>
    <head>
    </head>
    <body>
        <p>Foo1</p>
        <p>Foo2</p>
        <p>Foo3</p>
        <div>
            <p>Bar1</p>
            <p>Bar2</p>
        </div>
        <ul>
            <li>
                <p>Baz1</p>
                <p>Baz2</p>
                <p>Baz3</p>
                <p>Baz4</p>
            </li>
        </ul>
    </body>
</html>

OUTPUT 输出值

Array
(
    [0] => Array
        (
            [0] => Foo1
            [1] => Foo2
            [2] => Foo3
        )

    [1] => Array
        (
            [0] => Baz1
            [1] => Baz2
            [2] => Baz3
            [3] => Baz4
        )

)

PHP is likely giving you that error because your zero-width assertion is useless to repeat, both perl and javascript do not warn you of that. PHP可能会给您该错误,因为零宽度断言无法重复,而perl和javascript都不会警告您。

If you match it once you can match as many times as you like, because it doesn't actually consume anything. 如果匹配一次,则可以匹配任意多次,因为它实际上并不消耗任何东西。

Depending on what you intend to do you might be able to get away with a regex. 根据您打算做什么,您也许可以摆脱正则表达式的困扰。 But if you need to actually know about your HTML in any fashion you'd be best off with using an HTML parsing library. 但是,如果您实际上需要以任何方式了解HTML,那么最好使用HTML解析库。

What is it that you need to do? 您需要做什么?

Why don't you use XPath instead? 为什么不改用XPath The expression then would simply be: 表达式将只是:

//p[name(following-sibling::*[1]) = 'p' and name(following-sibling::*[2]) = 'p']

The query will find all p anywhere in the document which have two p immediately following. 该查询将在文档的任何位置找到所有p ,紧随其后的是两个p

Example ( demo ): 示例( demo ):

$html = <<< HTML
<div>
    <p>lore</p>
    <p>ipsum</p>
    <p>dolor</p>
    <br/>
    <p>sit</p>
    <p>amet</p> 
</div>
HTML;

We only want to find the first element in this snippet. 我们只想在此片段中找到第一个元素。 The code would then be: 代码如下:

$query = "//p[
    name(following-sibling::*[1]) = 'p' and 
    name(following-sibling::*[2]) = 'p'
]";

print_r(xpath_match_all($query, $html));

Output: 输出:

Array(
    [0] => Array(
        [0] => <p>lore</p>
    )
    [1] => Array(
        [0] => lore
    )
)

The resulting array contains the outerHTML and innerHTML for that query. 结果数组包含该查询的externalHTML和innerHTML。

Of course you don't have to use the xpath_match_all function. 当然,您不必使用xpath_match_all函数。 It's just a convenience utility. 这只是一个便利工具。 For alternatives, see How do you parse and process HTML/XML in PHP? 有关替代方法,请参见如何在PHP中解析和处理HTML / XML?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM