I'm trying to get a list of products off a website including the individual product codes. The product codes are 5 digit codes, the elements range in complexity from
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
Unfortunately, the 5 digit patterns are throughout the web pages, so I can't just use /\\d{5}/
I'm after a regex that extracts only the 5 digits in the Part Number elements and not from the rest of the web page.
Something like: /\\<p\\>Part\\s*Number\\:\\s*((\\d{5}) repeat this capture group n times)\\<\\/p\\>/
I know I can do it by breaking the page down in stages and applying one regex after another. eg
1st stage /\\<p\\>Part\\s*Number\\:\\s*.*?\\<\\/p\\>/
2nd stage /\\d{5}/
But is it possible do it in one regex pattern and if so how?
I am far wiser now than I was a year ago, so I have completely scrubbed my original advice. The best / most reliable approach when trying to parse valid html is to use a dom parser. XPath makes node/element hunting super easy. A regex pattern is still an appropriate tool once you have disqualified <p>
tags that do not contain the Part Number
keyword.
Code: ( Demo )
$html = <<<HTML
<p>Zip Code: 99501</p>
<p>Part Number: 67001</p>
<p>Part Number: 98765 - 10000kg capacity</p>
<p>Some dummy/interfering text. Part Number: 12345</p>
<p>Zip Codes: 99501, 99524 , 85001 and 72201</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
HTML;
$partnos = [];
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//p[starts-with(., 'Part Number: ')]") as $node) {
// echo "Qualifying text: {$node->nodeValue}\n";
if (preg_match_all('~\b\d{5}\b~', $node->nodeValue, $matches)) {
$partnos = array_merge($partnos, $matches[0]); //or array_push($partnos, ...$matches[0]);
}
}
var_export($partnos);
Output:
array (
0 => '67001',
1 => '98765',
2 => '50545',
3 => '50525',
4 => '50520',
5 => '50555',
6 => '50575',
)
The xpath query says:
//p #find p tags at any level/position in the dom
[starts-with(. #with a substring at the start of the node's text
, 'Part Number: ')] #that literally matches "Part Number: "
The regex pattern uses word boundary metacharacters ( \\b
) to differentiate part numbers from non-part numbers. If you need the pattern to be adjusted because of some data that is not represented in your question, let me know and I'll offer further guidance.
Finally, I did flirt with a pure regex solution that incorporated \\G
to "continue" matching after Part Number:
OR a previous match, but this type of pattern is a little bit harder to conceptualize and again a dom parser is a more stable tool versus regex when processing valid html.
If I understood your question correctly you should just be able to do this:
Part\\sNumber:\\s(\\d{5})
Given that your string contains all the Part Number
, like demonstrated below:
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 23425 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 52232</p>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.