简体   繁体   中英

preg_match pattern

I want to create a proper preg_match pattern to extract all <link *rel="stylesheet"* /> within the <head> of some webpages. So this pattern: #<link (.+?)>#is worked fine until I realized it catches also the <link rel="shortcut icon" href="favicon.ico" /> that's in the <head> . So I want to alter the pattern so that it makes sure there IS the word stylesheet somewhere WITHIN the link. I think it needs to use some lookaround but I'm not sure how to do it. Any help will be much appreciated.

Here we go again... don't use a regex to parse html , use an html parser like PHP DOMDocument .
Here's an example of how to use it:

$html = file_get_contents("https://stackoverflow.com");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//link[@rel='stylesheet']") as $link)
{
    echo $link->getAttribute("href");
}

PHPFiddle Demo

To do this with regular expressions, it would be best to do this as a two part operation, the first part is to separate out the head from the body to ensure you're only working within the head.

Then second part will parse the head looking for the desired links

Parsing Links

<link\\s*(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?rel=['"]?stylesheet)(?:[^>=]|='(?:[^']|\\\\')*'|="(?:[^"]|\\\\")*"|=[^'"][^\\s>]*)*\\s*>

正则表达式可视化

This expression will do the following:

  • find all the <link tags
  • ensure the link tag has the desired attribute rel='stylesheet
  • allow attribute values to have single, double or no quotes
  • avoid messy and difficult edge cases that the HTML Parse Police cry about

Example

Live Demo

https://regex101.com/r/hC5dD0/1

Sample Text

Note the difficult edge case in the last line.

<link *rel="stylesheet"* />
<link rel="shortcut icon" href="favicon.ico" />
<link onmouseover=' rel="stylesheet" ' rel="shortcut icon" href="favicon.ico">

Sample Matches

<link *rel="stylesheet"* />

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  <link                    '<link'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    rel=                     'rel='
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    stylesheet               'stylesheet'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^']                     any character except: '''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM