简体   繁体   中英

Parsing HTML tags from inside XML in PHP

I'm trying to create my own RSS feed (learning purposes) using simplexml_load_string while parsing http://uk.news.yahoo.com/rss in PHP. I get stuck at reading the HTML tags inside the <description> tag.

My code so far looks like this:

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);

//for each element in the feed
foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

             //how to read the href from the a tag???

             //this does not work at all
             $tags = $item->xpath('//a');
             foreach ($tags as $tag) {
                 echo $tag['href'];
             }
       }
}

Any ideas how to extract each HTML tag?

Thanks

The description content has its special characters encoded, so it's not treated as nodes within the XML, rather it's just a string. You can decode the special characters, then load the HTML into DOMDocument and do whatever you want to do. For example:

foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

            $dom = new DOMDocument();
            $dom->loadHTML(htmlspecialchars_decode((string)$desc));

            $anchors = $dom->getElementsByTagName('a');
            echo $anchors->item(0)->getAttribute('href');
        }
}

XPath is also available for use with DOMDocument, see DOMXPath .

The <description> element of the RSS feed contains HTML. Like as outlined in How to parse CDATA HTML-content of XML using SimpleXML? you need to get the node-value of that element (the HTML) and parse it within an addtional parser.

The accepted answer to the linked question already shows this quite verbose, for SimpleXML it does not play much of a role here whether that RSS feed is using CDATA or just entities like in your case.

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss  = simplexml_load_string($feed);
$dom  = new DOMDocument(); // the HTML parser used for descriptions' HTML

foreach ($rss->channel->item as $item)
{
    echo '<h3>' . $item->title . '</h3>', "\n";

    foreach ($item->description as $desc)
    {
        $dom->loadHTML($desc);

        $html = simplexml_import_dom($dom)->body;

        echo $html->p->a['href'], "\n";
    }
}

Exemplary output:

...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...

Hope this helps. Contrary to the accepted answer I see no reason to apply htmlspecialchars_decode , actually I'm pretty sure this breaks things. Also my example shows how you can stay inside the SimpleXML way of accessing the further children by showing how to turn the DOMNode back into a SimpleXMLElement once the HTML has been parsed.

The best thing to do here is to use the var_dump() function on $item.

feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
foreach ($rss->channel->item as $item) {
    var_dump($item);
    exit;
}

Once you do that you'll see that the value you are after is called "link". Therefore to print out the URL you will use the following code:

echo $item->link;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM