简体   繁体   English

PHP 正则表达式 html 具有固定标记的数据属性

[英]PHP regex html data attributes with fixed markup

I have the following fixed pattern markup scenarios我有以下固定模式标记场景

<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>

I'm trying to parse the following values out我正在尝试解析以下值

id123
bar
qux (if it ever exists)

I was able to figure out how to get the different scenarios, but I'm haven't trouble coming up with one final rule that would work for all scenarios.我能够弄清楚如何获得不同的场景,但是我想出一条适用于所有场景的最终规则并不困难。

/<div class="myclass" id="(.*)" data-foo="(.*)"(data-baz="(.*)")?>/

I seem to be missing some basic regex principle.我似乎缺少一些基本的正则表达式原则。 I tried bounding and ending and whitespace but not luck.我尝试了边界和结尾以及空格,但没有运气。

  1. I do not endorse using regex to parse html, but you say that you are optimizing for speed and that the markup is predictably structured.我不赞成使用正则表达式来解析 html,但您说您正在优化速度并且标记的结构是可预测的。
  2. You just need to use lazy quantifiers with those dots and show a little more care regarding the optional spaces您只需要对这些点使用惰性量词,并对可选空格多加注意

Code: ( Demo )代码:(演示

$text = <<<TEXT
<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>
TEXT;

preg_match_all('~<div class="myclass" id="(.*?)" data-foo="(.*?)" ?(?:data-baz="(.*?)" ?)?>~', $text, $matches);
var_export(array_slice($matches, 1));

Output: Output:

  0 => 
  array (
    0 => 'id123',
    1 => 'id123',
    2 => 'id123',
    3 => 'id123',
  ),
  1 => 
  array (
    0 => 'bar',
    1 => 'bar',
    2 => 'bar',
    3 => 'bar',
  ),
  2 => 
  array (
    0 => '',
    1 => '',
    2 => 'qux',
    3 => 'qux',
  ),
)

You can improve the regex efficiency by not using lazy quantifiers.您可以通过不使用惰性量词来提高正则表达式的效率。 If you know that the attribute values will not contain double-quotes, then you can use a this negated character class with a greedy quantifier: [^"]* .如果您知道属性值将不包含双引号,那么您可以使用带有贪心量词的这个否定字符 class : [^"]*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM