Match tags whitin tags with preg_match_all in php

Question

I'm currently working on a way to parse a HTML-document into a database. I'm not allowed to change any formatting from the HTML document. In the following example i need to find which tags have class id "Category", and then grab the data within this tag, in this example "Example Text".

How do I get the code to not only match tags which are directly ended afterwards?

$tags = "<p class=Category style='margin-left:0in;text-indent:0in'><a name='_
Toc390163149'></a><a name='_Ref388370252'></a><a
name='_Toc122858606'><span lang=EN-GB>3.<span style='font:7.0pt 'Times New 
Roman''>&nbsp;</span></span><span lang=EN-GB>Example Text</span></a></p>";

preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $tags, $matches, PREG_SET_
        foreach ($matches as $val) {
            echo "matched: " . htmlspecialchars($val[0]) . "</br>";
            echo "part 1: " . htmlspecialchars($val[1]) . "</br>";
            echo "part 2: " . htmlspecialchars($val[2]) . "</br>";
            echo "part 3: " . htmlspecialchars($val[3]) . "</br>";
            echo "part 4: " . htmlspecialchars($val[4]) . "</br></br>";
        }

Outputs:

matched: <a name="_Toc390163149"></a>
part 1: <a name="_Toc390163149">
part 2: a
part 3:
part 4: </a>

matched: <a name="_Ref388370252"></a>
part 1: <a name="_Ref388370252">
part 2: a
part 3:
part 4: </

matched: <span lang=EN-GB>When not to follow Rules</span>
part 1: <span lang=EN-GB>
part 2: span
part 3: When not to follow Rules
part 4: </span>

Any ideas?

Answer 1

Short answer, you can't parse complicated data formats such as HTML with regex, or at least you shouldn't.

Long answer, PHP provides a number of libraries for parsing HTML that would be both far less effort and far less prone to errors than the regex solution would be. The two of interest are going to be SimpleXML (if you're parsing XHTML) and DOMDocument (if you're parsing markup that may or may not be XML). I'd be inclined to use the latter for HTML.

Once you've loaded the markup into a DOMDocument, you can use an XPath query to locate all the p.category tags and iterate over them to get their child nodes and content.

Match tags whitin tags with preg_match_all in php

Question

1 answers

solution1
0 ACCPTED 2014-07-22 09:34:37

Match tags whitin tags with preg_match_all in php

Question

1 answers

solution1 0 ACCPTED 2014-07-22 09:34:37

solution1
0 ACCPTED 2014-07-22 09:34:37