简体   繁体   中英

Parsing name value pairs from long string using regular expressions

Regular expressions are a sure way to bring me back to earth. I don't think I've ever produced one without help, so here is another cry for help. Here's the example input:

{{Taxobox | name = Impala | status = LC | status_system = IUCN3.1 | status_ref = {{IUCN2008|assessors=IUCN SSC Antelope Specialist Group |year=2008|id=550|title=Aepyceros melampus|downloaded=18 January 2009}} Database entry includes a brief justification of why this species is of least concern | trend = stable | image = Serengeti Impala3.jpg | image_caption= Young male Impala in [[Serengeti]], [[Tanzania]] | image2=Female_impala.jpg | image2_caption= Female Impala in [[Mikumi National Park]], [[Tanzania]] | regnum = [[Animal]]ia | phylum = [[Chordate|Chordata]] | classis = [[Mammal]]ia | ordo = [[Even-toed ungulate|Artiodactyla]] | familia = [[Bovid]]ae | subfamilia = '''Aepycerotinae''' | subfamilia_authority = [[John Edward Gray|Gray]], 1872 | genus = '''''Aepyceros''''' | genus_authority = [[Carl Jakob Sundevall|Sundevall]], 1847 | species = '''''A. melampus''''' | subdivision_ranks = Subspecies | subdivision = * ''[[Aepyceros melampus petersi|A. m. petersi]]'' * ''A. m. melampus'' | range_map=Leefgebied_impala.JPG | range_map_caption=Range map | binomial = ''Aepyceros melampus'' | binomial_authority = ([[Martin Lichtenstein|Lichtenstein]], 1812) | range_map = Impala.png | range_map_caption = Distribution of the Impala
Red =A. m. melampus
Blue = A. m. petersi }}

Sorry I can't get this formatted in a better way. It's a long string with no newlines in them. It is essentially a set of name-value pairs. Each pair the format:

pipe space attributename space equals space attributevalue space

There's no obvious end character to a pair, other than the pipe of the next pair.

What I'd like to do is to turn this into an associative array in PHP. For what it's worth, here's my attempt of at least trying to find some matches:

$pattern = "/\|([^=|^.]*)=([^\|]*)|/s";
if (preg_match_all($pattern, $pagecontent, $matches)) {
var_dump($matches);
} else echo "no match!";

It's way of so don't pay too much attention to it. I'm hoping for some regex masters to help me out here.

You need to isolate the contained string from the {{ and }} delimiters before you try to extract the pairs. This will fail on your example due to what looks like nested grouping with status_ref={{...}} . You will need preg_replace_callback and a (?R) pattern for that.

A regex like this might work for the pairs itself however:

"/(?<=  ^ | \|)  # start, of string, or after any |
  \s*(\w+)       # name
  (?:\s*=\s*(    #  =
  \{\{.*?\}\}    # {{....}}
  | \[\[.*?\]\]  # [[...]]
  | \(.*?\)      # (...)
  | [^|]+) )?    # plain values
 /sx"

It will give you an associative array with:

$array = array_combine($matches[1], $matches[2]);

With the lonely name tokens not getting an associated value of course.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM