简体   繁体   English

php curl DOM,如何使用样式提取内容

[英]php curl DOM, how to extract content with style

It might be unclear in the title. 标题可能不清楚。 What I want to achieve here is copy all content in a specific div in an existing webpage (not owned my me). 我要在这里实现的是将所有内容复制到现有网页中的特定div中(不属于我本人)。 Now the code can successfully extract content. 现在,代码可以成功提取内容了。

Extractor code: 提取器代码:

    // Get Data
    $curl_handle=curl_init();
    curl_setopt($curl_handle, CURLOPT_URL,'http://au.creative.com/p/speakers/creative-t4-wireless');
    curl_setopt($curl_handle, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4 );
    curl_setopt($curl_handle, CURLOPT_POST, false);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl_handle, CURLOPT_HEADER, 0);
    curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101');
    //$html = curl_exec($curl_handle);
    $html = file_get_html('http://au.creative.com/p/speakers/creative-t4-wireless');
    curl_close($curl_handle);


    //Display required part
    $xml = new DomDocument;
    @$xml->loadHTML($html);
    $xpath = new DomXpath($xml);
    $info = $xpath->query('//div[@class="wrapper features-contents"]')->item(0);
    echo utf8_decode($xml->saveXML($info));
    echo '<textarea rows="500" cols="100">' . $xml->saveXML($info) .'</textarea>';

Extracted code: 提取的代码:

<h3 class="feature-header">Pair and connect in so many ways</h3> 
<div class="row product-info-row"> 
<div class="span12"> 
<div id="slides-modes-21677" style="position:relative;">
<a id="arrow-left-21677" class="slidesjs-previous slidesjs-navigation" href="#">
<img src="//d287ku8w5owj51.cloudfront.net/inline/products/21430/arrow_left.jpg" border="0" alt="<" width="42" height="54"/></a> <div id="slide1">
<img style="margin:0 20px 0 20px;" src="//d287ku8w5owj51.cloudfront.net/inline/products/21677/bluetooth.jpg.ashx?width=520&height=383" alt="Freedom without compromise" width="520" height="383" align="right"/>

It is clear only the class name been extracted. 很明显,仅提取了类名。 I remember when you copy webpage content from chrome and paste to firefox. 我记得当您从chrome复制网页内容并粘贴到Firefox时。 The css is been transform info inline style. CSS已被转换为信息内联样式。 Is it possible I can do it in php? 有可能我可以在php中做到吗?

some part of webpage content I got in firefox: 我在firefox中获得的部分网页内容:

    <h3 class="feature-header" style="font-size: 2.2857em; margin: 20px 0px 30px; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.4; text-transform: uppercase; color: #666666; font-style: normal; font-variant: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">PAIR AND CONNECT IN SO MANY WAYS</h3>
    <div class="row product-info-row" style="margin-bottom: 60px; margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span12" style="float: left; min-height: 1px; margin-left: 20px; width: 940px;">
    <div id="slides-modes-21677" style="position: relative; overflow: hidden;">
    <div class="slidesjs-container" style="overflow: hidden; position: relative; width: 940px; height: 383px;">
    <div class="slidesjs-control" style="position: relative; left: 0px; width: 940px; height: 383px;">
    <div id="slide1" class="slidesjs-slide" style="position: absolute; top: 0px; left: 0px; width: 940px; z-index: 10; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle; margin: 0px 20px;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/bluetooth.jpg.ashx?width=520&amp;height=383" alt="Freedom without compromise" width="520" height="383" align="right" />
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Freedom without compromise</h3>
    <p style="margin: 0px 0px 1em;"><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>wireless connectivity gives you the freedom and convenience to move around your room with your smart device as you're not tied down by any wires.<sup style="line-height: 0; position: relative; vertical-align: baseline; top: -0.5em;">1</sup><span class="Apple-converted-space">&nbsp;</span>And with aptX, you're assured of uncompromised audio quality.</p>
    </div>
    <div id="slide2" class="slidesjs-slide" style="position: absolute; top: 0px; left: 940px; width: 940px; z-index: 0; display: block; -webkit-backface-visibility: hidden;">
    <div style="margin: 0px 20px; vertical-align: middle; float: left;"><img id="fea_nfc_2" style="border: 0px; vertical-align: middle;" src="http://img.creative.com/inline/products/21677/fea_nfc_2.jpg" alt="" /></div>
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Just tap and pair</h3>
    <p style="margin: 0px 0px 1em;">With the NFC (Near Field Communication) receptor on the Audio Control Pod, you can simply tap your NFC-enabled device on it to pair and then you're all set to stream and enjoy your music.</p>
    </div>
    <div id="slide3" class="slidesjs-slide" style="position: absolute; top: 0px; left: -940px; width: 940px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle; margin: 0px 20px;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/multipoint.png.ashx?width=520&amp;height=383" alt="Stay connected" width="520" height="383" align="right" />
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Stay connected</h3>
    <p style="margin: 0px 0px 1em;">Connect with multiple<span class="Apple-converted-space">&nbsp;</span><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>devices! With Creative Multipoint, you can have two<span class="Apple-converted-space">&nbsp;</span><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>stereo devices paired to the speakers at any one time and easily toggle between them.<sup style="line-height: 0; position: relative; vertical-align: baseline; top: -0.5em;">2</sup></p>
    </div>
    </div>
    </div>
    <a id="arrow-right-21677" class="slidesjs-next slidesjs-navigation" style="color: #0cbdef; text-decoration: none; cursor: pointer; display: block; overflow: hidden; position: absolute; top: 164.5px; z-index: 30; right: 0px;" href="http://au.creative.com/p/speakers/creative-t4-wireless#"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21430/arrow_right.jpg" alt="&lt;" width="42" height="54" border="0" /></a></div>
    </div>
    </div>
    <div class="row" style="margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <div class="slides-perfect-audio" style="width: 460px; display: block; overflow: hidden;">
    <div class="slidesjs-container" style="overflow: hidden; position: relative; width: 460px; height: 327.8723404255319px;">
    <div class="slidesjs-control" style="position: relative; left: 0px; width: 460px; height: 327.8723404255319px;">
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: 0px; width: 460px; z-index: 10; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/optical.png" alt="Optical input" /></div>
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: 460px; width: 460px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/RCA.png" alt="RCA input" /></div>
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: -460px; width: 460px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/aux_in.png" alt="Aux in" /></div>
    </div>
    </div>
    <ul class="slidesjs-pagination" style="margin: 10px auto; padding: 0px; display: block; width: 60px; list-style: none;">
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a class="active" style="color: #cccccc !important; text-decoration: none; cursor: pointer; padding: 0px; background-color: #999999; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; margin-right: 5px; display: inline-block; background-position: 100% 0%;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="0">1</a></li>
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a style="color: #ffffff; text-decoration: none; cursor: pointer; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; background-color: #ffffff; margin-right: 5px; display: inline-block;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="1">2</a></li>
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a style="color: #ffffff; text-decoration: none; cursor: pointer; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; background-color: #ffffff; margin-right: 5px; display: inline-block;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="2">3</a></li>
    </ul>
    </div>
    </div>
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/playing_games.jpg" alt="Switch to private listening" /></div>
    </div>
    <div class="row product-info-row" style="margin-bottom: 60px; margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <h4 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Even more connectivity options</h4>
    <p style="margin: 0px 0px 1em;">The Creative T4 Wireless comes with an optical input for digital signals, so you can directly send audio from sources such as your HD TV or sound cards without loss of resolution. It also has RCA analog inputs for connection to your video console or DVD player, as well as a 3.5mm input for connection to smart devices and portable media players.</p>
    </div>
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <h4 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Switch to private listening</h4>
    <p style="margin: 0px 0px 1em;">For late-night gaming or movie-watching, there's no need to worry about waking up the household. The Creative T4 Wireless' Audio Control Pod is integrated with a dedicated headphone jack so that you can conveniently plug in your headphones when the need arises.</p>
    </div>
    </div>

Why don't you use wget for that ? 为什么不使用wget呢?

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.org/tutorials/html/

http://www.linuxjournal.com/content/downloading-entire-web-site-wget http://www.linuxjournal.com/content/downloading-entire-web-site-wget

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM