简体   繁体   中英

Regular expression html table extract using PHP function preg_match_all

I want to extract table from html page which contains nested html table tags after that I want to extract <td> and <tr> of tables.

I am using this. Its working fine for <b> and </b>

$file = file_get_contents($url);
preg_match_all ("/<b>(.*)<\/b>/U", $file, $pat_array);
print $pat_array[0][0]." <br> ".$pat_array[0][1]."\n";

Can anybody tell me regular expression for nested <table (some table properties)> some data using <tr> and <td> </table> . Please keep the href if present in the <tr> or <td> fields, and keep in mind the needed tables.

Example:

$file = "<html> <head> <title> asdf </title> </head> <body bgcolor = red >  <table border = 1> <table bgcolor = white> (some tr and td data > </table> </table></body> </body> </html>"

preg_match_all ("regular expression for table tag", $file, $pat_array);
print $pat_array[0][0]." <br> ".$pat_array[0][1]."\n";

Update 1 :

When I tried below code it shows the error:

Notice: Undefined offset: 0 in C:\\xampp\\htdocs\\testphp\\tabledata.php on line 27

Code:

$file = file_get_contents($url);
$pat_array = Array();
preg_match_all ("/<tr>(.*)<\/tr>/U", $file, $pat_array);
print $pat_array[1][0];

Can anybody help me regarding this error also?

Don't try to parse HTML with regex, use DOMDocument and DOMXpath instead.

$dom = new DOMDocument();
$dom->loadHtml($file);

$xpath = new DOMXpath($dom);
$tableNodes = $xpath->query('//table'); // select all table nodes

// do something, e.g. print node content
foreach ($tableNodes as $tableNode) {
    print $tableNode->nodeValue;
}

There are a lot more query options which you can perform with xpath, have a look here . Also you propably want to do something else with the selected nodes than just printing the content. If you are looking for the sub DOM of each table, try this:

foreach ($tableNodes as $tableNode) {
    $newDom = new DOMDocument();
    $clone = $tableNode->cloneNode(true);
    $clone = $newDom->importNode($clone, true);
    $newDom->appendChild($clone);

    $html = $newDom->saveHTML();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM