简体   繁体   中英

Repeating capture group with a regex pattern

I'm trying to get a list of products off a website including the individual product codes. The product codes are 5 digit codes, the elements range in complexity from

<p>Part Number: 67001</p>

<p>Part Number: 50545 &ndash; 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>

Unfortunately, the 5 digit patterns are throughout the web pages, so I can't just use /\\d{5}/

I'm after a regex that extracts only the 5 digits in the Part Number elements and not from the rest of the web page.

Something like: /\\<p\\>Part\\s*Number\\:\\s*((\\d{5}) repeat this capture group n times)\\<\\/p\\>/

I know I can do it by breaking the page down in stages and applying one regex after another. eg

1st stage /\\<p\\>Part\\s*Number\\:\\s*.*?\\<\\/p\\>/
2nd stage /\\d{5}/

But is it possible do it in one regex pattern and if so how?

I am far wiser now than I was a year ago, so I have completely scrubbed my original advice. The best / most reliable approach when trying to parse valid html is to use a dom parser. XPath makes node/element hunting super easy. A regex pattern is still an appropriate tool once you have disqualified <p> tags that do not contain the Part Number keyword.

Code: ( Demo )

$html = <<<HTML
<p>Zip Code: 99501</p>
<p>Part Number: 67001</p>
<p>Part Number: 98765 - 10000kg capacity</p>
<p>Some dummy/interfering text. Part Number: 12345</p>
<p>Zip Codes: 99501, 99524 , 85001 and 72201</p>
<p>Part Number: 50545 &ndash; 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
HTML;

$partnos = [];

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//p[starts-with(., 'Part Number: ')]") as $node) {
    // echo "Qualifying text: {$node->nodeValue}\n";
    if (preg_match_all('~\b\d{5}\b~', $node->nodeValue, $matches)) {
        $partnos = array_merge($partnos, $matches[0]); //or array_push($partnos, ...$matches[0]);
    }
}
var_export($partnos);

Output:

array (
  0 => '67001',
  1 => '98765',
  2 => '50545',
  3 => '50525',
  4 => '50520',
  5 => '50555',
  6 => '50575',
)

The xpath query says:

//p                  #find p tags at any level/position in the dom
[starts-with(.       #with a substring at the start of the node's text
, 'Part Number: ')]  #that literally matches "Part Number: "

The regex pattern uses word boundary metacharacters ( \\b ) to differentiate part numbers from non-part numbers. If you need the pattern to be adjusted because of some data that is not represented in your question, let me know and I'll offer further guidance.

Finally, I did flirt with a pure regex solution that incorporated \\G to "continue" matching after Part Number: OR a previous match, but this type of pattern is a little bit harder to conceptualize and again a dom parser is a more stable tool versus regex when processing valid html.

If I understood your question correctly you should just be able to do this:

Part\\sNumber:\\s(\\d{5})

Given that your string contains all the Part Number , like demonstrated below:

<p>Part Number: 67001</p>

<p>Part Number: 50545 &ndash; 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>

<p>Part Number: 23425 - 55kg Drum 50575 *Indent - 175kg Drum</p>

<p>Part Number: 52232</p>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM