简体   繁体   中英

Match tags whitin tags with preg_match_all in php

I'm currently working on a way to parse a HTML-document into a database. I'm not allowed to change any formatting from the HTML document. In the following example i need to find which tags have class id "Category", and then grab the data within this tag, in this example "Example Text".

How do I get the code to not only match tags which are directly ended afterwards?

$tags = "<p class=Category style='margin-left:0in;text-indent:0in'><a name='_
Toc390163149'></a><a name='_Ref388370252'></a><a
name='_Toc122858606'><span lang=EN-GB>3.<span style='font:7.0pt 'Times New 
Roman''>&nbsp;</span></span><span lang=EN-GB>Example Text</span></a></p>";

preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $tags, $matches, PREG_SET_
        foreach ($matches as $val) {
            echo "matched: " . htmlspecialchars($val[0]) . "</br>";
            echo "part 1: " . htmlspecialchars($val[1]) . "</br>";
            echo "part 2: " . htmlspecialchars($val[2]) . "</br>";
            echo "part 3: " . htmlspecialchars($val[3]) . "</br>";
            echo "part 4: " . htmlspecialchars($val[4]) . "</br></br>";
        }

Outputs:

matched: <a name="_Toc390163149"></a>
part 1: <a name="_Toc390163149">
part 2: a
part 3:
part 4: </a>

matched: <a name="_Ref388370252"></a>
part 1: <a name="_Ref388370252">
part 2: a
part 3:
part 4: </

matched: <span lang=EN-GB>When not to follow Rules</span>
part 1: <span lang=EN-GB>
part 2: span
part 3: When not to follow Rules
part 4: </span>

Any ideas?

Short answer, you can't parse complicated data formats such as HTML with regex, or at least you shouldn't.

Long answer, PHP provides a number of libraries for parsing HTML that would be both far less effort and far less prone to errors than the regex solution would be. The two of interest are going to be SimpleXML (if you're parsing XHTML) and DOMDocument (if you're parsing markup that may or may not be XML). I'd be inclined to use the latter for HTML.

Once you've loaded the markup into a DOMDocument, you can use an XPath query to locate all the p.category tags and iterate over them to get their child nodes and content.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM