[英]Scrape HTML page with multiple <table> tags and extract text from specific <a> tag descendants
I have this html source code in database field.我在数据库字段中有这个 html 源代码。 I would like to analyze this code, in particular the fields of some tables, and print them on the screen.
我想分析这段代码,特别是一些表格的字段,并将它们打印在屏幕上。 This is the code about table:
这是关于表的代码:
<table cellspacing="1" cellpadding="1" class="troop_details inReturn"
>
<thead>
<tr>
<td class="role">
<a href="/karte.php?d=91628">01] #WorkInProgress</a>
</td>
<td colspan="11" class="troopHeadline">
<a href="/karte.php?d=91611">Return from 01-soldier</a>
</td>
</tr>
</thead>
<tbody class="units">
<tr>
<th class="coords">
‭<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(‭−‭1‬‬</span><span class="coordinatePipe">|</span><span class="coordinateY">‭−‭28‬‬)</span></span>‬ </th>
<td class="uniticon">
<img class="unit u21" title="Phalanx: 1:12:51" alt="Phalanx" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u22" title="Swordsman: 1:25:00" alt="Swordsman" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u23" title="Pathfinder: 0:30:00" alt="Pathfinder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u24" title="Theutates Thunder: 0:26:51" alt="Theutates Thunder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u25" title="Druidrider: 0:31:53" alt="Druidrider" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u26" title="Haeduan: 0:39:14" alt="Haeduan" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u27" title="Ram: 2:07:30" alt="Ram" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u28" title="Trebuchet: 2:50:00" alt="Trebuchet" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u29" title="Chieftain: 1:42:00" alt="Chieftain" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u30" title="Settler: 1:42:00" alt="Settler" src="/img/x.gif" /> </td>
<td class="uniticon last">
<img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" /> </td>
</tr>
</tbody>
<tbody class="units last">
<tr>
<th>Troops</th>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit">
500 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none last">
0 </td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Bounty</th>
<td colspan="11">
<div class="res">
<div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6758</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">8093</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">6908</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">15741</span></div></div> </div>
<div class="carry">
<img class="carry full" title="carry"
alt="carry"
src="/img/x.gif"/> ‭‭37500‬ / ‭37500‬‬ </div>
</td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Arrival</th>
<td colspan="11">
<div class="in">in <span class="timer" counting="down" value="85">0:01:25</span> hrs.</div>
<div class="at"><span>at 00:43:10</span><span> </span></div>
</td>
</tr>
</tbody>
</table>
<a name="at"></a>
<table cellspacing="1" cellpadding="1" class="troop_details inReturn"
>
<thead>
<tr>
<td class="role">
<a href="/karte.php?d=91628">01] #WorkInProgress</a>
</td>
<td colspan="11" class="troopHeadline">
<a href="/karte.php?d=94829">Return from 0-New Hulk</a>
</td>
</tr>
</thead>
<tbody class="units">
<tr>
<th class="coords">
‭<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(‭−‭1‬‬</span><span class="coordinatePipe">|</span><span class="coordinateY">‭−‭28‬‬)</span></span>‬ </th>
<td class="uniticon">
<img class="unit u21" title="Phalanx: 0:45:33" alt="Phalanx" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u22" title="Swordsman: 0:53:09" alt="Swordsman" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u23" title="Pathfinder: 0:18:46" alt="Pathfinder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u24" title="Theutates Thunder: 0:16:47" alt="Theutates Thunder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u25" title="Druidrider: 0:19:56" alt="Druidrider" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u26" title="Haeduan: 0:24:32" alt="Haeduan" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u27" title="Ram: 1:19:44" alt="Ram" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u28" title="Trebuchet: 1:46:18" alt="Trebuchet" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u29" title="Chieftain: 1:03:47" alt="Chieftain" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u30" title="Settler: 1:03:47" alt="Settler" src="/img/x.gif" /> </td>
<td class="uniticon last">
<img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" /> </td>
</tr>
</tbody>
<tbody class="units last">
<tr>
<th>Troops</th>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit">
400 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none last">
0 </td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Bounty</th>
<td colspan="11">
<div class="res">
<div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6130</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">5835</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">5638</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">12397</span></div></div> </div>
<div class="carry">
<img class="carry full" title="carry"
alt="carry"
src="/img/x.gif"/> ‭‭30000‬ / ‭30000‬‬ </div>
</td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Arrival</th>
<td colspan="11">
<div class="in">in <span class="timer" counting="down" value="920">0:15:20</span> hrs.</div>
<div class="at"><span>at 00:57:05</span><span> </span></div>
</td>
</tr>
</tbody>
</table>
The data that interest me are the following:我感兴趣的数据如下:
Thanks to your advice this is my code at the moment:感谢您的建议,这是我目前的代码:
<?php include 'database.php' ?>
<?php session_start(); ?>
<?php
include_once('simple_html_dom.php');
$caserma = $_SESSION["caserma"];
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($_SESSION["caserma"], LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath->query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
$texts[] = $textNode->nodeValue;
}
var_export($texts);
?>
But as output it gives me array ( )但是作为 output 它给了我数组()
Code assuming $_SESSION["caserma"]
contains your full html document: ( Demo )假设
$_SESSION["caserma"]
包含您的完整 html 文档的代码:(演示)
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($_SESSION["caserma"]);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath->query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
$texts[] = $textNode->nodeValue;
}
var_export($texts);
Output from your sample input: Output 来自您的样本输入:
array (
0 => 'Return from 01-soldier',
1 => 'Return from 0-New Hulk',
)
XPath Breakdown: XPath 故障:
// # search to any depth in the document
table[contains(@class, 'troop_details') and contains(@class, 'inReturn')] # find all table tags with both `troop_details` and `inReturn` classes
// # continue searching any descendants of any matches
td[@class='troopHeadline'] # match all td tags with `troopHeadline` as its class
// # continue searching anydescendants of any matches
a[@href] # match all a tags with an href attribute
/ # search the immediate descendant (any first generation child)
text() # match the text of the parent a tag
libxml_use_internal_errors(true)
is used to silence any potential errors from an "invalid" document. libxml_use_internal_errors(true)
用于消除“无效”文档中的任何潜在错误。contains(...) and contains()
in the xpath so that even if the class attributes change their order or new classes are added to the element, the xpath will still match correctly.contains(...) and contains()
非常重要,这样即使 class 属性更改其顺序或将新类添加到元素中,xpath 仍将正确匹配foreach()
loop will iterate all qualifying text nodes. foreach()
循环将迭代所有符合条件的文本节点。nodeValue
and push it into the result array.nodeValue
并将其推送到结果数组中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.