简体   繁体   English

PHP简单HTML DOM解析器

[英]PHP Simple HTML DOM parser

I am working with simple web crawler. 我正在使用简单的Web搜寻器。 Below is simple html code i used to learn. 以下是我曾经学习过的简单html代码。

input.php input.php

<ul id="nav">
    <li>
        <a href="www.google.com">Google</a>
        <ul>
            <li>
                <a href="mail.gmail.com">Gmail</a>
            </li>
        </ul>
    </li>
    <li>
        <a href="www.yahoo.com">Yahoo</a>
        <ul>
            <li>
                <a href="mail.yahoo.com">Yahoo Mail</a>
            </li>
        </ul>
    </li>
</ul>

I need to crawl the first anchor tag in ul[id=nav]->li . 我需要在ul[id=nav]->li抓取第一个锚标记。 The code i used to crawl input.php is 我用来抓取input.php的代码是

<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',0)->outertext."<br>";              
        }
    }
?>

It Displays all the anchor tag in my input.php. 它在我的input.php中显示所有锚标记。 I need to display only google and yahoo. 我只需要显示google和yahoo。 How can i achieve this? 我怎样才能做到这一点?

<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
                echo $navUL_LI->find('a',0)->outertext."<br>";
                       }

        }
    }
?>

In this case you can directly point it out with children() method. 在这种情况下,您可以使用children()方法直接指出。 Example: 例:

foreach($html->find('ul#nav') as $ul) {
    foreach($ul->children() as $li) {
        echo $li->children(0)->outertext . '<br/>';
    }
}

Alternatively, you can use DOMDocument + DOMXpath for this too: 另外,您也可以使用DOMDocument + DOMXpath

$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[@id="nav"]/li/a');

foreach($links as $a) {
    echo $a->nodeValue . '<br/>';
}

i have done the same work in Objective-c. 我在Objective-c中做了同样的工作。

You can use the XML or HTML api's to serialize your html object. 您可以使用XML或HTML API来序列化html对象。

If you want to do this form cold hand... find open tag and the close tag. 如果您想以冷手的方式进行此操作,请找到打开标签和关闭标签。

After this get first child, then the second and so on... 之后生下第一个孩子,然后生下第二个,依此类推...

you can simply achieve that by: 您可以通过以下方法简单地实现:

<?php
      foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',-2)->outertext."<br>";              
        }
    }
?>

Try this: 尝试这个:

// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
    $a = $li->find('a',0);
    // retrieve the link text itself.
    echo "link text: " . $a->innertext() . "\n";
}

See the simple-html-dom manual for details of all these methods. 有关所有这些方法的详细信息,请参见simple-html-dom手册

<?php
$in = '<style>      .catalog-product-view .product.attribute.overview ul {         margin-top: 10px;     } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';

function parseTags($input, $callback) {
    $len = strlen($input);
    $stack = [];

    $tag = "";
    $data = "";
    $isTag = false;
    $isString = false;
    for ($i=0; $i<$len; $i++) {
       $char = $input[$i];
       if ($char == '<') {
           $isTag = true;
           $tag .= $char;
       } else if ($char == '>') {
           $tag .= $char;
           if (substr($tag, 0, 2) == '</') {
               $close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
               $open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
               if ($open == $close) {
                   $callback($tag, $data, $stack, $i, false);
                   array_pop($stack);
               }
           } else if (substr($tag, -2) == '/>') {
               $callback($tag, $data, $stack, $i, false);
           } else {
               $callback($tag, $data, $stack, $i, true);
               $stack[] = $tag;
           }
           $tag = "";
           $data = "";
           $isTag = false;
       } else if ($char == '"' || $char == "'") {
           if ($isString == false) {
               $isString = $char;
           } else if ($isString == $char && $input[$i-1] != '\\') {
               $isString = false;
           }
       } else if ($isTag) { 
           $tag .= $char; 
       } else { 
           $data .= $char; 
       }
    }
}

parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
    print_r(func_get_args());
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM