简体   繁体   English

在PHP中从XML内部解析HTML标记

[英]Parsing HTML tags from inside XML in PHP

I'm trying to create my own RSS feed (learning purposes) using simplexml_load_string while parsing http://uk.news.yahoo.com/rss in PHP. 我正在尝试使用simplexml_load_string创建自己的RSS提要(学习目的),同时在PHP中解析http://uk.news.yahoo.com/rss I get stuck at reading the HTML tags inside the <description> tag. 我一直在阅读<description>标签内的HTML标签。

My code so far looks like this: 到目前为止,我的代码如下所示:

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);

//for each element in the feed
foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

             //how to read the href from the a tag???

             //this does not work at all
             $tags = $item->xpath('//a');
             foreach ($tags as $tag) {
                 echo $tag['href'];
             }
       }
}

Any ideas how to extract each HTML tag? 有关如何提取每个HTML标记的任何想法?

Thanks 谢谢

The description content has its special characters encoded, so it's not treated as nodes within the XML, rather it's just a string. 描述内容的特殊字符是编码的,所以它不被视为XML中的节点,而只是一个字符串。 You can decode the special characters, then load the HTML into DOMDocument and do whatever you want to do. 您可以解码特殊字符,然后将HTML加载到DOMDocument中并执行您想要执行的任何操作。 For example: 例如:

foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

            $dom = new DOMDocument();
            $dom->loadHTML(htmlspecialchars_decode((string)$desc));

            $anchors = $dom->getElementsByTagName('a');
            echo $anchors->item(0)->getAttribute('href');
        }
}

XPath is also available for use with DOMDocument, see DOMXPath . XPath也可用于DOMDocument,请参阅DOMXPath

The <description> element of the RSS feed contains HTML. RSS提要的<description>元素包含HTML。 Like as outlined in How to parse CDATA HTML-content of XML using SimpleXML? 如何如何使用SimpleXML解析XML的CDATA HTML内容? you need to get the node-value of that element (the HTML) and parse it within an addtional parser. 您需要获取该元素的节点值(HTML)并在addtional解析器中解析它。

The accepted answer to the linked question already shows this quite verbose, for SimpleXML it does not play much of a role here whether that RSS feed is using CDATA or just entities like in your case. 链接问题的接受答案已经显示出相当冗长,对于SimpleXML而言,无论RSS源是使用CDATA还是仅使用像您的情况那样的实体,它在这里都不起作用。

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss  = simplexml_load_string($feed);
$dom  = new DOMDocument(); // the HTML parser used for descriptions' HTML

foreach ($rss->channel->item as $item)
{
    echo '<h3>' . $item->title . '</h3>', "\n";

    foreach ($item->description as $desc)
    {
        $dom->loadHTML($desc);

        $html = simplexml_import_dom($dom)->body;

        echo $html->p->a['href'], "\n";
    }
}

Exemplary output: 示例输出:

...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...

Hope this helps. 希望这可以帮助。 Contrary to the accepted answer I see no reason to apply htmlspecialchars_decode , actually I'm pretty sure this breaks things. 与接受的答案相反,我认为没有理由应用htmlspecialchars_decode ,实际上我很确定这会破坏事情。 Also my example shows how you can stay inside the SimpleXML way of accessing the further children by showing how to turn the DOMNode back into a SimpleXMLElement once the HTML has been parsed. 此外,我的示例还展示了如何通过展示如何在解析HTML后将DOMNode转换回SimpleXMLElement来保持SimpleXML访问其他子节点的方式。

The best thing to do here is to use the var_dump() function on $item. 这里最好的做法是在$ item上使用var_dump()函数。

feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
foreach ($rss->channel->item as $item) {
    var_dump($item);
    exit;
}

Once you do that you'll see that the value you are after is called "link". 一旦你这样做,你会发现你所追求的价值被称为“链接”。 Therefore to print out the URL you will use the following code: 因此,要打印出URL,您将使用以下代码:

echo $item->link;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM