简体   繁体   中英

Regex PHP find and match HTML tags with specific data-attributes

I'm trying to parse a CTP file (CakePHP template with HTML and PHP tags in it) and want to match all the HTML tags with specific data-attributes (data-edit="true"). Each tag with data-edit="true" MUST have a data-type="..." and data-name="..." attribute. I would like to capture these attributes in (named) groups, so I can use them in my code. So far I have the following regex:

\<(?<tagname>\w+).*?(?>data\-edit="true").*?\>(?<content>.*?)\<\/(?&tagname)\>

Here are some samples of the tags it should match:

<h4 data-type="text" data-edit="true" data-name="SomeName">Some content, with or without newlines.</h4>

and

<span data-edit="true" data-type="wysiwyg" data-name="Beoordeling">Some text 
with <strong>tags</strong> and newlines in it that 
should not break the parser.</span>

From the above examples I would like the regex to return the content of the data-type and data-name tag, and of course the content (between the tags) itself.

The data-attributes can occur in whatever order and it is possible other attributes are present in the tags (such as classes). So far I've managed to get the content of only the tags with a data-edit="true" attribute, but when has a newline, the match breaks. Also I can't capture the other data-attributes.

It is even possible what I want to achieve? I know regex isn't the preferred way to parse HTML, but as this is a CTP file with all kinds of other tags in it, I can't use an XML parser.

Edit: sample code: https://regex101.com/r/nF6a96/2

You should avoid parsing html using regex but since this is a case of attribute lookup within a tag and not some nested scenario of tags, hence you can use regex to do a quick validation here.

You need to use lookaheads in ensuring that the tag does contain all three kind of attributes you are looking for. You can use this regex,

<(\w+)(?=.*?data-edit="true")(?=.*?data-type="[^"]*")(?=.*?data-name="[^"]*")[^>]*?>.*?<\/\1>

Explanation:

  • <(\\w+) --> matches a tag and captures the tagname in group1 to match at the end of closing tag
  • (?=.*?data-edit="true") --> lookahead and ensures data-edit attribute is present
  • (?=.*?data-type="[^"]*") --> lookahead and ensures data-type attribute is present
  • (?=.*?data-name="[^"]*") --> lookahead and ensures data-name attribute is present
  • [^>]*?> --> matches rest of the input and closing tag
  • .*? --> matches whatever text is within the starting and ending tag
  • <\\/\\1> --> matches the closing tag

Demo

XPath is such a fantastic and versative tool. Your logic seamlessily transfers to an xpath query which is easy to construct, read, and maintain in the future.

Furthermore, XPath is superior to regex because it will successfully match qualifying elements no matter the order of the attributes. Regex will struggle to do the same with just one preg_ call.

The following will validate, extract, and store by loop the results of just one query.

Code: ( Demo )

$dom=new DOMDocument; 
libxml_use_internal_errors(true);  // for malformed html warning suppression
$dom->loadHTML($text, LIBXML_NOENT);
//libxml_clear_errors();             // for  warning suppression
$xpath = new DOMXPath($dom);

foreach ($xpath->query("//*[@data-edit='true' and @data-type and @data-name]") as $node) {
    $results[] = [
                    'type' => $node->getAttribute('data-type'),
                    'name' => $node->getAttribute('data-name'),
                    'text' => $node->textContent
                 ];
}
var_export($results);

Output:

array (
  0 => 
  array (
    'type' => 'wysiwyg',
    'name' => 'Beoordeling',
    'text' => 'We beoordelen uw aanvraag en                                        berichten u over de acceptatie daarvan.',
  ),
  1 => 
  array (
    'type' => 'text',
    'name' => 'Bellen',
    'text' => 'We bellen u voor een afspraak.',
  ),
  2 => 
  array (
    'type' => 'text',
    'name' => 'Technisch specialist',
    'text' => 'Technisch specialist neemt bij u alles nog even door.',
  ),
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM