简体   繁体   中英

How to Scrape HTML tags from a div using php simple html dom or Curl

Here is an Example of what i want to do Example:

<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>

From the above emaple I would like to scrape data and tags in arrays. In the result I would like an array containing: arr = [h1,p,h2]; and another array: arr2 = [This is h1,This is paragraph,This is h2]

Assuming the elements are known you could use the domdocument 's getelementsbytagname like this:

$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$doc = new DOMDocument();
$doc->loadhtML($html);
$elements = array();
$content = array();
function iterate_elements($array, $doc){
     global $elements, $content;
     foreach($array as $element){
          $the_element = $doc->getElementsByTagName($element);
          foreach($the_element as $target){
               $content[] = $target->textContent;
               //$target->tagName;         
          }
          if(!empty($the_element->length)) {
               $elements[] =  $element;
         }
     }
}
iterate_elements(array('h1','p', 'h2'), $doc);
print_r($elements);
print_r($content);

Demo: https://eval.in/825860

Try this;

$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";

$arr = explode(PHP_EOL, $str);

$res =array();
Foreach($arr as $row){
    If(!strpos($row, "div") !== False){
        $res[substr($row, 1, strpos($row, ">")-1)] = strip_tags($row); 
    }
}

Var_dump($res);

https://3v4l.org/8TkIT

It loops through one line at the time and creates the array with named keys.

Edit if there is more than one room you can make it multidimensional like this:
https://3v4l.org/DdXVd

$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
<div class='room2'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";

$arr = explode(PHP_EOL, $str);

$res =array();
Foreach($arr as $row){
    If(strpos($row, "div") !== False){
        $pos1 = strpos($row, "'")+1;
        $room = substr($row, $pos1, strpos($row, "'", $pos1)-$pos1);
    }Else{
        $pos1 = strpos($row, "<")+1;
        $res[$room][substr($row, strpos($row, "<")+1, strpos($row, ">")-$pos1)] = trim(strip_tags($row)); 
    }
}

Var_dump($res);

try below code.

$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";

$dom = new SimpleXMLElement( $html );

$values = array_filter( array_values( (array) $dom ), function ( $i ) { return ! is_array( $i ); } );
$keys = array_filter( array_keys( (array) $dom ), function ( $i ) { return $i != '@attributes'; } );

print_r( $values ); // This is a h1, This is a Paragraph, This is h2
print_r( $keys ); // h1, p, h2

I used array_filter for remove div tag from result.

$str = <<<EOF
<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
EOF;

$html = str_get_html($str);

foreach($html->find('.room *') as $el){
  $arr[] = $el->tag;
  $arr2[] = $el->text();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM