简体   繁体   中英

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...

<div id="result" >
    <div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
        <div class="res_main">
            <h2 class="res_main_top">
                <img 
                    src="/ff/gigablast.com.png" 
                    alt="favicon for gigablast.com" 
                    width=16 
                    height=16
                    />&nbsp;
                <a 
                    href="http://www.gigablast.com/" 
                    rel="nofollow"
                    >
                    Gigablast
                </a>
                <div class="res_main">
                    <h2 class="res_main_top">
                        <img 
                            src="/ff/ask.com.png" 
                            alt="favicon for ask.com" 
                            width=16 
                            height=16
                            />&nbsp;
                        <a 
                            href="http://ask.com/" rel="nofollow"
                            >
                            Ask.com - What&#039;s Your Question?
                        </a>....

I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');

$data = $doc->getElementById('result');

then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!

You can call getElementsByTagName on a DOMElement object:

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');

$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');

$urls = array();
foreach ($anchors as $a) {
    $urls[] = $a->getAttribute('href');
}

If you want to get image sources as well, that would be easy to add.

If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
    $urls[] = $anchor->attributes->href;
}

// $urls is your collection of urls in the original document.

Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:

$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);

$query = '//div[@class="res_main"]//a';
$nodes = $xpath->query($query);

$urls = array();

foreach ($nodes as $node) {
    $href = $node->getAttribute('href');
    if (!empty($href)) {
        $urls[] = $href;
    }
}

This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM