使用PHP从HTML中提取内容

Question

Here is my HTML file: 这是我的HTML文件：

<html>
  <head>    
    <link href='http://wendyandgabe.blogspot.com/favicon.ico' rel='icon' type='image/x-icon'/>
    <link href='http://wendyandgabe.blogspot.com/' rel='canonical'/>
    <link rel="alternate" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://wendyandgabe.blogspot.com/feeds/posts/default" />
    <link rel="alternate" type="application/rss+xml" title="O&#39; Happy Day! - RSS" href="http://wendyandgabe.blogspot.com/feeds/posts/default?alt=rss" />
    <link rel="service.post" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://www.blogger.com/feeds/5390468261501503598/posts/default" />
  </head>
  <body>
  </body>
</html>

I want to extract the url of href where type="application/rss+xml" from the above html file. 我想从上面的html文件中提取href的url，其中type="application/rss+xml" 。 How is it possible? 这怎么可能？ Can anybody show some example code? 任何人都可以显示一些示例代码吗？

Answer 1

You can use 您可以使用

DomDocument http://php.net/manual/de/class.domdocument.php and DomDocument http://php.net/manual/de/class.domdocument.php和

and 和

DomXPath http://de3.php.net/manual/de/class.domxpath.php DomXPath http://de3.php.net/manual/de/class.domxpath.php

$html = <<<EOF
<html>
  <head>    
    <link href='http://wendyandgabe.blogspot.com/favicon.ico' rel='icon' type='image/x-icon'/>
    <link href='http://wendyandgabe.blogspot.com/' rel='canonical'/>
    <link rel="alternate" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://wendyandgabe.blogspot.com/feeds/posts/default" />
    <link rel="alternate" type="application/rss+xml" title="O&#39; Happy Day! - RSS" href="http://wendyandgabe.blogspot.com/feeds/posts/default?alt=rss" />
    <link rel="service.post" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://www.blogger.com/feeds/5390468261501503598/posts/default" />
  </head>
  <body>
  </body>
</html>
EOF;

$xml = new DomDocument;
$xml->loadHTML($html);

//create a xpath instance
$xpath = new DomXpath($xml);

//query for <link type="application/rss+xml"> and use the first found item
$link = $xpath->query('//link[@type="application/rss+xml"]')->item(0);


var_dump($link->getAttribute('href'));

http://3v4l.org/PkH8n http://3v4l.org/PkH8n

Answer 2

You can try this PHP class DOMDocument 您可以尝试这个PHP类DOMDocument

http://php.net/manual/en/domdocument.loadhtml.php http://php.net/manual/en/domdocument.loadhtml.php

Answer 3

Using PHP Simple HTML DOM Parser , here's how: 使用PHP Simple HTML DOM Parser ，方法如下：

// includes Simple HTML DOM Parser
include "simple_html_dom.php";

$text = '<html>
  <head>    
    <link href="http://wendyandgabe.blogspot.com/favicon.ico" rel="icon" type="image/x-icon"/>
    <link href="http://wendyandgabe.blogspot.com/" rel="canonical"/>
    <link rel="alternate" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://wendyandgabe.blogspot.com/feeds/posts/default" />
    <link rel="alternate" type="application/rss+xml" title="O&#39; Happy Day! - RSS" href="http://wendyandgabe.blogspot.com/feeds/posts/default?alt=rss" />
    <link rel="service.post" type="application/atom+xml" title="O&#39; Happy Day! - Atom" href="http://www.blogger.com/feeds/5390468261501503598/posts/default" />
  </head>
  <body>
  </body>
</html>';


//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($text);


// Find the link with the appropriate selectors
$link = $html->find('link[type=application/rss+xml]', 0);


// Find succeeded
if ($link) {
    $href = $link->href;
    echo $href;
}
else
    echo "Find function failed !";


// Clear DOM object (needed essentially when using many)
$html->clear(); 
unset($html);

OUTPUT 
======
http://wendyandgabe.blogspot.com/feeds/posts/default?alt=rss

DEMO DEMO

使用PHP从HTML中提取内容

问题描述

3 个解决方案

解决方案1
2 2013-10-23 06:52:15

解决方案2
0 2013-10-23 06:40:13

解决方案3
0 2013-10-23 10:53:22

使用PHP从HTML中提取内容

问题描述

3 个解决方案

解决方案1 2 2013-10-23 06:52:15

解决方案2 0 2013-10-23 06:40:13

解决方案3 0 2013-10-23 10:53:22

解决方案1
2 2013-10-23 06:52:15

解决方案2
0 2013-10-23 06:40:13

解决方案3
0 2013-10-23 10:53:22