[英]Get the innerHTML of an element, but not the element itself
我正在从 2 列表中提取数据。 第一列是变量名,第二列是该变量的数据。
我几乎可以正常工作,但是某些数据可能包含 HTML 并且通常包含在 DIV 中。 我想在 DIV 中获取 HTML,而不是 DIV 本身。 我知道正则表达式可能是一个解决方案,但我想更好地理解 DOMDocument。
这是我到目前为止的代码:
private function readHtml()
{
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlData = curl_exec($curl);
curl_close($curl);
$dom = new \DOMDocument();
$html = $dom->loadHTML($htmlData);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$table = [];
$key = null;
$value = null;
foreach ($rows as $i => $row){
//skip the heading columns
if($i <= 1 ) continue;
$cols = $row->getElementsByTagName('td');
foreach ($cols as $count => $node) {
if($count == 0) {
$key = strtolower(str_replace(' ', '_',$node->textContent));
} else {
$htmlNode = $node->getElementsByTagName('div');
if($htmlNode->length >=1) {
$innerHTML= '';
foreach ($htmlNode as $innerNode) {
$innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
}
$value = $innerHTML;
} else {
$value = $node->textContent;
}
}
}
$table[$key] = $value;
}
return $table;
}
我的输出是正确的,但我不想包含包含 HTML 的数据的包装 DIV:
Array
(
[type] => raw
[direction] => north
[intro] => Welcome to the test.
[html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
[count] => 1003
)
更新
根据答案中的一些反馈和想法,这是函数的当前迭代,它更精简并返回所需的输出。 我对双正则表达式感觉不太好,但它的工作原理。
private function readHtml()
{
# the url given in your example
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$dom = new \DOMDocument();
$dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$table = [];
$key = null;
$value = null;
foreach ($rows as $i => $row){
//skip the heading columns
if($i <= 1 ) continue;
$cols = $row->getElementsByTagName('td');
foreach ($cols as $count => $node) {
if($count == 0) {
$key = strtolower(str_replace(' ', '_',$node->textContent));
} else {
$value = $node->ownerDocument->saveHTML( $node );
$value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
$value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
}
}
$table[$key] = $value;
}
return $table;
}
preg_replace
! 像这样:$table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);
请参阅此处了解preg_replace
。 有关正则表达式的用法,请参见此处。
<?php
include 'simple_html_dom.php';//<--- Must download to current directory
$url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml';
$html = file_get_html( $url );
foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) {
echo $element->innertext;
//See http://simplehtmldom.sourceforge.net/manual.htm for usage
}
?>
你在正确的轨道上! 下一个级别是学习非常强大的xpath
语句,这是一个像DomDocument
这样的解析器提供的。 考虑这个代码示例:
<?php
# the url given in your example
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new \DOMXpath($doc);
# here comes the magic
$html_body = $xpath->query("//td[text()='html_body']")->item(0);
$div_text = $html_body->nextSibling->textContent;
echo $div_text;
?>
线索是在DOM
查询文本节点等于html_body
的列,这是通过//td[here comes the expression to filter on all columns in the dom]
html_body
//td[here comes the expression to filter on all columns in the dom]
。 之后,只需取下一个兄弟姐妹。 考虑到这一点,您甚至可以在waffle
表中的所有行上使用 foreach 重写整个函数:
foreach($xpath->query("//table[@class='waffle']//tr") as $row) {
// do sth. useful here
}
对于您的具体示例,这可能是(这有点短,不是吗?):
<?php
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new \DOMXpath($doc);
foreach ($xpath->query("//table[@class='waffle']//tr") as $row) {
$columns = $xpath->query("./td", $row);
$key_td = $columns->item(0);
$value_td = $columns->item(1);
echo "[" . $key_td->nodeValue . "]: " . $value_td->nodeValue . "\n";
}
?>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.