I am trying to parse the HTTP Accept header to extract all the details from it. I am making the following assumptions:
Each entry must start with and contain at least type/subtype
, optionally with a +basetype
For example text/html
or application/xhtml+xml
Entries are separated by a comma. After the initial type/subtype
, the entry may contain a variable number of parameter key=value
pairs, separated by a semicolon (whitespace is allowed between semicolons but not between =
of key=value
pair) For example application/xhtml+xml; q=0.8; test=hello
application/xhtml+xml; q=0.8; test=hello
I want to get all of this information into an array.
What I currently have is preg_match_all('/([^,;\\/=\\s]+)\\/([^,;\\/=\\s+]+)(\\+([^,;\\/=\\s+]+))?(\\s?;\\s?([^,;=]+)=([^,;=]+))*/', $header, $result, PREG_SET_ORDER);
which, to my mind, gives an initial capture group with the type, then one with the subtype, then an optional one with the basetype, then an optional repeating one, separated by ;
, which contains the two key=value
.
When used with a header string application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello
application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello
application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello
this gives me:
Array
(
[0] => Array
(
[0] => application/xhtml+xml; q=0.9; level=3
[1] => application
[2] => xhtml
[3] => +xml
[4] => xml
[5] => ; level=3
[6] => level
[7] => 3
)
[1] => Array
(
[0] => text/html
[1] => text
[2] => html
)
[2] => Array
(
[0] => application/json;test=hello
[1] => application
[2] => json
[3] =>
[4] =>
[5] => ;test=hello
[6] => test
[7] => hello
)
)
which is fine except that only the last key=value
is given for the first entry ( application/xhtml+xml; q=0.9; level=3
), the q=0.9
is missing.
Is there any way I can include all the (variable number of) parameters in each match, while still using just the one regular expression, or do I have to use a separate regular expression / function for the key=value
pairs?
EDIT:
The kind of array result I would like is this (obviously items 0, 3, 5, 8... etc for each content type are unnecessary, but I don't know if they can be excluded):
Array
(
[0] => Array
(
[0] => application/xhtml+xml; q=0.9; level=3
[1] => application
[2] => xhtml
[3] => +xml
[4] => xml
[5] => ; q=0.9
[6] => q
[7] => 0.9
[8] => ; level=3
[9] => level
[10] => 3
)
[1] => Array
(
[0] => text/html
[1] => text
[2] => html
)
[2] => Array
(
[0] => application/json;test=hello
[1] => application
[2] => json
[3] =>
[4] =>
[5] => ;test=hello
[6] => test
[7] => hello
)
)
This allows me to grab the key and value for each parameter without doing any further regexp or string functions.
EDIT
I have accepted Ka 's answer, which seems to give me all that I need. Using his pattern (?:\\G\\s?,\\s?|^)(\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+))
on the same string (without the set order) gives the result:
Array
(
[0] => Array
(
[0] => application/xhtml+xml
[1] => ; q=0.9
[2] => ; level=3
[3] => , text/html
[4] => ,application/json
[5] => ;test=hello
)
[1] => Array
(
[0] => application
[1] =>
[2] =>
[3] => text
[4] => application
[5] =>
)
[2] => Array
(
[0] => xhtml
[1] =>
[2] =>
[3] => html
[4] => json
[5] =>
)
[3] => Array
(
[0] => xml
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
)
[4] => Array
(
[0] =>
[1] => q
[2] => level
[3] =>
[4] =>
[5] => test
)
[5] => Array
(
[0] =>
[1] => 0.9
[2] => 3
[3] =>
[4] =>
[5] => hello
)
)
from which I can compile an associative array using the array of index 1 to determine the boundaries between the individual content types with their parameters.
Many thanks to Ka for his/her help.
EDIT:
Changed the expression again - the expression also needs to be able to parse wildcard mimes such as text/*
. So the expression now becomes:
(?:\G\s?,\s?|^)(\w+|\*)\/(\w+|\*)(?:\+(\w+))?|(?<!^)\G(?:\s?;\s?(\w+)=([\w\.]+))
I would recommend you use php's parse functions instead of trying to write your own.
See this for details: http://php.net/manual/en/ref.http.php
and more particularly for your situation:
A little different than your desired output but will safely get all the values without the ones you don´t need:
RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?:\\s?;\\s?(\\w+)=([\\w\\.]+))
(with global flag g )
Explained demo: http://regex101.com/r/fM1gJ2
Edit: This is better used on already validated headers as it´s composed with a regex or
, you can use this regex \\w+\\/\\w+(\\+\\w+)?(\\s?;\\s?\\w+=[\\w\\.]+)*
to validate.
OR
Something along the lines:
RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?(?:\\s?;\\s?(\\w+)=([\\w\\.]+))?
with the last part (?:\\s?;\\s?(\\w+)=([\\w\\.]+))?
repeated as many times you think you´ll have to
Demo: http://regex101.com/r/yI6uS1
Validation and capture on the same go using global flag g
RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+))
Explained demo here: http://regex101.com/r/bR7kU2
Update (content types must always be separated by a comma)
RegEx: (?:\\G\\s?,\\s?|^)(\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+))
Demo: http://regex101.com/r/nG4oV0
And a shorter repeating end pattern for v2: (?:\\s?;\\s?((?4))=((?5)))?
in case you increase the key=value
character sets, explained here . Or even shorter if you allow some unnecessary data to get saved in the array with this regex:
(\w+)\/(\w+)(?:\+(\w+))?(\s?;\s?([\w-]+)=([\w!:\$\.-]+))?((?4))?
and repeat ((?4))?
as needed, see it here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.