简体   繁体   中英

How can I correct this regular expression to capture all the repeating parameter groups in PHP?

I am trying to parse the HTTP Accept header to extract all the details from it. I am making the following assumptions:

Each entry must start with and contain at least type/subtype , optionally with a +basetype For example text/html or application/xhtml+xml Entries are separated by a comma. After the initial type/subtype , the entry may contain a variable number of parameter key=value pairs, separated by a semicolon (whitespace is allowed between semicolons but not between = of key=value pair) For example application/xhtml+xml; q=0.8; test=hello application/xhtml+xml; q=0.8; test=hello

I want to get all of this information into an array.

What I currently have is preg_match_all('/([^,;\\/=\\s]+)\\/([^,;\\/=\\s+]+)(\\+([^,;\\/=\\s+]+))?(\\s?;\\s?([^,;=]+)=([^,;=]+))*/', $header, $result, PREG_SET_ORDER); which, to my mind, gives an initial capture group with the type, then one with the subtype, then an optional one with the basetype, then an optional repeating one, separated by ; , which contains the two key=value .

When used with a header string application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello application/xhtml+xml; q=0.9; level=3 , text/html,application/json;test=hello this gives me:

Array
(
    [0] => Array
        (
            [0] => application/xhtml+xml; q=0.9; level=3 
            [1] => application
            [2] => xhtml
            [3] => +xml
            [4] => xml
            [5] => ; level=3 
            [6] => level
            [7] => 3 
        )

    [1] => Array
        (
            [0] => text/html
            [1] => text
            [2] => html
        )

    [2] => Array
        (
            [0] => application/json;test=hello 
            [1] => application
            [2] => json
            [3] => 
            [4] => 
            [5] => ;test=hello 
            [6] => test
            [7] => hello 
        )

)

which is fine except that only the last key=value is given for the first entry ( application/xhtml+xml; q=0.9; level=3 ), the q=0.9 is missing.

Is there any way I can include all the (variable number of) parameters in each match, while still using just the one regular expression, or do I have to use a separate regular expression / function for the key=value pairs?

EDIT:

The kind of array result I would like is this (obviously items 0, 3, 5, 8... etc for each content type are unnecessary, but I don't know if they can be excluded):

Array
(
    [0] => Array
        (
            [0] => application/xhtml+xml; q=0.9; level=3 
            [1] => application
            [2] => xhtml
            [3] => +xml
            [4] => xml
            [5] => ; q=0.9 
            [6] => q
            [7] => 0.9 
            [8] => ; level=3 
            [9] => level
           [10] => 3 
        )

    [1] => Array
        (
            [0] => text/html
            [1] => text
            [2] => html
        )

    [2] => Array
        (
            [0] => application/json;test=hello 
            [1] => application
            [2] => json
            [3] => 
            [4] => 
            [5] => ;test=hello 
            [6] => test
            [7] => hello 
        )

)

This allows me to grab the key and value for each parameter without doing any further regexp or string functions.

EDIT

I have accepted Ka 's answer, which seems to give me all that I need. Using his pattern (?:\\G\\s?,\\s?|^)(\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+)) on the same string (without the set order) gives the result:

Array
(
    [0] => Array
        (
            [0] => application/xhtml+xml
            [1] => ; q=0.9
            [2] => ; level=3
            [3] =>  , text/html
            [4] => ,application/json
            [5] => ;test=hello
        )

    [1] => Array
        (
            [0] => application
            [1] => 
            [2] => 
            [3] => text
            [4] => application
            [5] => 
        )

    [2] => Array
        (
            [0] => xhtml
            [1] => 
            [2] => 
            [3] => html
            [4] => json
            [5] => 
        )

    [3] => Array
        (
            [0] => xml
            [1] => 
            [2] => 
            [3] => 
            [4] => 
            [5] => 
        )

    [4] => Array
        (
            [0] => 
            [1] => q
            [2] => level
            [3] => 
            [4] => 
            [5] => test
        )

    [5] => Array
        (
            [0] => 
            [1] => 0.9
            [2] => 3
            [3] => 
            [4] => 
            [5] => hello
        )

)

from which I can compile an associative array using the array of index 1 to determine the boundaries between the individual content types with their parameters.

Many thanks to Ka for his/her help.

EDIT:

Changed the expression again - the expression also needs to be able to parse wildcard mimes such as text/* . So the expression now becomes:

(?:\G\s?,\s?|^)(\w+|\*)\/(\w+|\*)(?:\+(\w+))?|(?<!^)\G(?:\s?;\s?(\w+)=([\w\.]+))

I would recommend you use php's parse functions instead of trying to write your own.

See this for details: http://php.net/manual/en/ref.http.php

and more particularly for your situation:

http://php.net/manual/en/function.http-parse-headers.php

A little different than your desired output but will safely get all the values without the ones you don´t need:

RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?:\\s?;\\s?(\\w+)=([\\w\\.]+)) (with global flag g )
Explained demo: http://regex101.com/r/fM1gJ2
Edit: This is better used on already validated headers as it´s composed with a regex or , you can use this regex \\w+\\/\\w+(\\+\\w+)?(\\s?;\\s?\\w+=[\\w\\.]+)* to validate.

OR

Something along the lines:

RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?(?:\\s?;\\s?(\\w+)=([\\w\\.]+))?
with the last part (?:\\s?;\\s?(\\w+)=([\\w\\.]+))? repeated as many times you think you´ll have to
Demo: http://regex101.com/r/yI6uS1

Update:

Validation and capture on the same go using global flag g
RegEx: (\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+))
Explained demo here: http://regex101.com/r/bR7kU2
Update (content types must always be separated by a comma)
RegEx: (?:\\G\\s?,\\s?|^)(\\w+)\\/(\\w+)(?:\\+(\\w+))?|(?<!^)\\G(?:\\s?;\\s?(\\w+)=([\\w\\.]+)) Demo: http://regex101.com/r/nG4oV0

And a shorter repeating end pattern for v2: (?:\\s?;\\s?((?4))=((?5)))? in case you increase the key=value character sets, explained here . Or even shorter if you allow some unnecessary data to get saved in the array with this regex:

(\w+)\/(\w+)(?:\+(\w+))?(\s?;\s?([\w-]+)=([\w!:\$\.-]+))?((?4))?

and repeat ((?4))? as needed, see it here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM