简体   繁体   中英

Parsing “flat” HTML structure with PHP DOM

I'm attempting to use PHP DOM with help parsing an HTML file that I want to translate into JSON. However, unfortunately the HTML DOM is fairly flat (and I have no way to change that). By flat I mean the structure is something like this:

<h2>title</h2>
<span>child node</span>
<span>another child</span>
<h2>title</h2>
<span>child node</span>
<span>another child</span>
<h2>title</h2>
<span>child node</span>
<span>another child</span>

I need to be able to get the <h2> 's and treat the <span> 's as children. I'm not completely set on using PHP DOM if there's a better alternative, it's simply what I found in an answer I came across , so please feel free to suggest anything. What I really need is to serve this HTML string into JSON, and PHP DOM looks like my best bet thus far.

$XML =<<<XML
    <h2>title</h2>
    <span>child node</span>
    <span>another child</span>
    <h2>title</h2>
    <span>child node</span>
    <span>another child</span>
    <h2>title </h2>
    <span>child node</span>
    <span>another child</span>
XML;

    $dom = new DOMDocument;
    $dom->loadHTML($XML);
    $xp = new DOMXPath($dom);

    $new = new DOMDocument;
    $root = $new->createElement('root');

    foreach($xp->query('/html//*/node()') as $i => $node) {
        if ($node->nodeType == XML_TEXT_NODE)
            continue;
        if ($node->nodeName == 'h2') {
            if(isset($current))
                $root->appendChild($current);
            $current = $new->createElement('div');
            $current->appendChild($new->importNode($node, true));
            continue;
        }
        $current->appendChild($new->importNode($node, true));
    }

    $new->appendChild($root);
    $xml2 = simplexml_load_string($new->saveHTML());
    echo json_encode($xml2);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM