繁体   English   中英

preg_match模式

[英]preg_match pattern

我想创建一个适当的preg_match模式,以提取某些网页<head>的所有<link *rel="stylesheet"* /> 因此,此模式: #<link (.+?)>#is正常工作,直到我意识到它也捕获了<head><link rel="shortcut icon" href="favicon.ico" />为止。 因此,我想更改模式,以确保在链接内某处有单词样式表。 我认为它需要使用一些环顾四周,但我不确定该怎么做。 任何帮助都感激不尽。

在这里,我们再次尝试... 不要使用正则表达式来解析html ,而应使用像PHP DOMDocument这样的html解析器
以下是使用方法的示例:

$html = file_get_contents("https://stackoverflow.com");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//link[@rel='stylesheet']") as $link)
{
    echo $link->getAttribute("href");
}

PHPFiddle演示

要使用正则表达式执行此操作,最好将其分为两部分进行操作,第一部分是将头部与身体分开,以确保您仅在头部内工作。

然后第二部分将解析头部以查找所需的链接

解析链接

<link\\s*(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?rel=['"]?stylesheet)(?:[^>=]|='(?:[^']|\\\\')*'|="(?:[^"]|\\\\")*"|=[^'"][^\\s>]*)*\\s*>

正则表达式可视化

该表达式将执行以下操作:

  • 找到所有的<link标签
  • 确保链接标记具有所需的属性rel='stylesheet
  • 允许属性值具有单引号,双引号或不带引号
  • 避免HTML解析警察大喊大叫的混乱且困难的情况

现场演示

https://regex101.com/r/hC5dD0/1

示范文本

注意最后一行中的困难边缘情况。

<link *rel="stylesheet"* />
<link rel="shortcut icon" href="favicon.ico" />
<link onmouseover=' rel="stylesheet" ' rel="shortcut icon" href="favicon.ico">

比赛样本

<link *rel="stylesheet"* />

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <link                    '<link'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    rel=                     'rel='
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    stylesheet               'stylesheet'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^']                     any character except: '''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM