简体   繁体   English

使用preg_match_all从html提取数据

[英]Extracting Data from html using preg_match_all

I have a series of html pages from which I want to extract certain product information. 我有一系列html页面,我想从中提取某些产品信息。 The HTML is build up like this: HTML是这样构建的:

<h1 style="margin-top: 20px;">Productinformatie</h1>


<div class="group">
<div class="columns2">
            <table width="100%" cellpadding="4" cellspacing="0" border="0" class="product_info_table stripe">
    <tr style="background-color: #3c75a6; color: #fff; font-weight: bold;">
        <td colspan="2" style="background-color: #3c75a6; border-bottom: 2px solid #9dbeda;">Design</td>
    </tr>
                    <tr class="normal">
            <td width="250" valign="top"><b>Kleur van het product</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">Zwart, Zilver</div></td>
        </tr>
.............
                    <tr class="normal">
            <td width="250" valign="top"><b>Hoogte (achterzijde)</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">3 cm</div></td>
        </tr>
                </table>

</div>  
</div>

<div class="group" style="overflow-x: auto; overflow-y: hidden; height: 140px; white-space: nowrap;" id="image_scroll">

I Use this line but does not get results; 我用这条线但是没有得到结果; I need to find out how Linebrakes (BR) can be formatted in preg_match_all 我需要找出如何在preg_match_all中格式化Linebrakes(BR)

        //Omschrijving  <h1 style="margin-top: 20px;">Productinformatie</h1>    <div class="group"> <div class="columns2">  </table>    </div>      </div>
//  preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*?)\<ul style\=\"list\-style\-type\: none\;\"\>/s', $html, $matchomschrijving);  
    preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*)?\<\/table\>.*?\<\/div\>?\<\/div\>/s', $html, $matchomschrijving);  
//  $tempomschrijvinghtml = str_replace('"',"'",$matchomschrijving[1][0]); 
    $tempomschrijvinghtml = MinifyHTML($matchomschrijving[1][0]);
//  $tempomschrijving = '<table>';
    $tempomschrijving .= $tempomschrijvinghtml;
    $tempomschrijving .= '</table></div></div>';
    echo 'Omschrijving: ' . $tempomschrijving . '<br>'; 

Thanks. 谢谢。

To search, extract and edit html, take advantage of the build-in DOMxxx classes and of the html structure. 要搜索,提取和编辑html,请利用内置的DOMxxx类和html结构。 With the XPath language you can efficiently target the part of the DOM tree you want. 使用XPath语言,您可以有效地定位所需的DOM树的一部分。 Example: 例:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('//h1[.="Productinformatie"]/following-sibling::div[@class="group"]/div[@class="columns2"]/table[1]');

echo $dom->saveHTML($nodeList->item(0));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM