简体   繁体   中英

Simple HTML DOM Parser scrape divs

I am trying to scrape some data, with Simple HTML DOM Parser, from a page that has the following structure:

    <div class='image'>
        <img class='a' src='1.jpg'>
    </div>
    <div class='data'>
        lorem ipsum 1
   </div>
    <div class='data'>
        lorem ipsum 2
   </div>
    <div class='data'>
        lorem ipsum 3
   </div>

    <div class='image'>
        <img class='a' src='2.jpg'>
    </div>
    <div class='data'>
        lorem ipsum 4
   </div>

    <div class='image'>
       <img class='a' src='3.jpg'>
    </div>
    <div class='data'>
        lorem ipsum 5
   </div>
        <div class='data'>
            lorem ipsum 6
       </div>

I can easily get all the data. My problem is that I cannot associate the images with the data divs underneath. (Divs are not nested)

I need to associate image 1.jpg with data 1, 2 & 3 image 2.jpg with data 4 image 3.jpg with data 5,6

The number of divs between the image divs are random

Is there any way to count the number of divs between two divs with class image even if they are not nested.

I apologize if the question seems complicated, but I assure you the question is very simple if you look at it carefully.

You could try to check the sequences by using a loop (foreach). Check if the div has an image class, if it has increment the grouping key, else, use the current key and push the data inside.

Rough example:

$data = array();
$html = str_get_html($html_markup);
$current_key = 0;
foreach ($html->find('div') as $div) {
    if($div->class == 'image') {
        $current_key++;
        $data[$current_key]['image'] = $div->find('img', 0)->src;
    }

    if($div->class == 'data') {
        $data[$current_key]['data'][] = $div->innertext;
    }
}

echo '<pre>';
print_r($data);

The data should be grouped something like this:

Array
(
    [1] => Array
    (
        [image] => 1.jpg
        [data] => Array
        (
            [0] =>      lorem ipsum 1 
            [1] =>      lorem ipsum 2 
            [2] =>      lorem ipsum 3 
        )
    )

    [2] => Array
    (
        [image] => 2.jpg
        [data] => Array
        (
            [0] =>      lorem ipsum 4 
        )
    )

    [3] => Array
    (
        [image] => 3.jpg
        [data] => Array
        (
            [0] =>      lorem ipsum 5 
            [1] =>      lorem ipsum 6 
        )

    )
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM