简体   繁体   中英

Scraping text generated by script with PHP Simple HTML DOM Parser

I am trying to get the following text "Huggies Pure Baby Wipes 4 x 64 per pack" shown in the code below.

<div class="offerList-item-description-title">
    <div id="result-title-5" class="offerList-item-description-title">
        <script type="text/javascript">
            document.write(getContents('wF8UD9Jj8:6D !FC6 q23J (:A6D c I ec A6C A24\<'));
        </script>Hug­gies Pure Baby Wipes 4 x 64 per pack
    </div>
</div>

I have tried using code such as:

foreach($element -> find('.offerList-item-description-title') as $title)
{
    foreach($element -> find('text') as $text){
        echo $text;
    }
}

But just get returned an empty string, any suggestions?

Thanks.

If you are aware your HTML returned by your scraper does not contain Javascript rendered code, like in your case text is generated by javascript that's why you are getting empty response. What you need is a headless browser like PhantomJS you can use PHP wrapper of PhantomJS http://jonnnnyw.github.io/php-phantomjs/ .

This will solve your problem. It has following features:

  • Load webpages through the PhantomJS headless browser
  • View detailed response data including page content, headers, status code etc.
  • Handle redirects
  • View javascript console errors

Hope this helps.

I'm not sure what code your using in your example (and I suspect the getContents function result gets in the way of your method for retrieving the text) but if you wrap the text you're after in a <span> like so:

<div class="offerList-item-description">
    <div id="result-title-5" class="offerList-item-description-title">
        <script type="text/javascript">
            document.write(getContents('wF8UD9Jj8:6D !FC6 q23J (:A6D c I ec A6C A24\<'));
        </script><span>Hug­gies Pure Baby Wipes 4 x 64 per pack</span>
    </div>
</div>

you can retrieve it using javascript:

<script>
    var $title = document.getElementsByClassName("offerList-item-description-title");
    for (var i = 0; i < $title.length; i++) {
        var span = $title[i].getElementsByTagName("span");
        var $text = span[0].innerText || span[0].textContent;
        //echo $text;
        console.log("==> " + $text);
    }
</script>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM