I want to create a proper preg_match pattern to extract all <link *rel="stylesheet"* />
within the <head>
of some webpages. So this pattern: #<link (.+?)>#is
worked fine until I realized it catches also the <link rel="shortcut icon" href="favicon.ico" />
that's in the <head>
. So I want to alter the pattern so that it makes sure there IS the word stylesheet somewhere WITHIN the link. I think it needs to use some lookaround but I'm not sure how to do it. Any help will be much appreciated.
Here we go again... don't use a regex to parse html , use an html parser like PHP DOMDocument .
Here's an example of how to use it:
$html = file_get_contents("https://stackoverflow.com");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//link[@rel='stylesheet']") as $link)
{
echo $link->getAttribute("href");
}
To do this with regular expressions, it would be best to do this as a two part operation, the first part is to separate out the head from the body to ensure you're only working within the head.
Then second part will parse the head looking for the desired links
<link\\s*(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?rel=['"]?stylesheet)(?:[^>=]|='(?:[^']|\\\\')*'|="(?:[^"]|\\\\")*"|=[^'"][^\\s>]*)*\\s*>
This expression will do the following:
<link
tags rel='stylesheet
Live Demo
https://regex101.com/r/hC5dD0/1
Sample Text
Note the difficult edge case in the last line.
<link *rel="stylesheet"* />
<link rel="shortcut icon" href="favicon.ico" />
<link onmouseover=' rel="stylesheet" ' rel="shortcut icon" href="favicon.ico">
Sample Matches
<link *rel="stylesheet"* />
NODE EXPLANATION
----------------------------------------------------------------------
<link '<link'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
rel= 'rel='
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
stylesheet 'stylesheet'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'] any character except: '''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^"] any character except: '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.