获取元素的innerHTML，而不是元素本身

Question

我正在从 2 列表中提取数据。 第一列是变量名，第二列是该变量的数据。

我几乎可以正常工作，但是某些数据可能包含 HTML 并且通常包含在 DIV 中。 我想在 DIV 中获取 HTML，而不是 DIV 本身。 我知道正则表达式可能是一个解决方案，但我想更好地理解 DOMDocument。

这是我到目前为止的代码：

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

我的输出是正确的，但我不想包含包含 HTML 的数据的包装 DIV：

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

更新

根据答案中的一些反馈和想法，这是函数的当前迭代，它更精简并返回所需的输出。 我对双正则表达式感觉不太好，但它的工作原理。

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

Answer 1

使用`preg_replace` ！像这样：

$table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);

请参阅此处了解preg_replace 。 有关正则表达式的用法，请参见此处。

或者！你可以像这样使用simple_html_dom.php ：

<?php
include 'simple_html_dom.php';//<--- Must download to current directory
$url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml';
$html = file_get_html( $url );
foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) {
    echo $element->innertext;
    //See http://simplehtmldom.sourceforge.net/manual.htm for usage
}
?>

Answer 2

你在正确的轨道上！ 下一个级别是学习非常强大的xpath语句，这是一个像DomDocument这样的解析器提供的。 考虑这个代码示例：

<?php
# the url given in your example    
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

$doc = new \DOMDocument();
$doc->loadHTMLFile($url);

$xpath = new \DOMXpath($doc);

# here comes the magic
$html_body = $xpath->query("//td[text()='html_body']")->item(0);
$div_text = $html_body->nextSibling->textContent;
echo $div_text;
?>

线索是在DOM查询文本节点等于html_body的列，这是通过//td[here comes the expression to filter on all columns in the dom] html_body //td[here comes the expression to filter on all columns in the dom] 。 之后，只需取下一个兄弟姐妹。 考虑到这一点，您甚至可以在waffle表中的所有行上使用 foreach 重写整个函数：

foreach($xpath->query("//table[@class='waffle']//tr") as $row) {
    // do sth. useful here
}

对于您的具体示例，这可能是（这有点短，不是吗？）：

<?php
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);

$xpath = new \DOMXpath($doc);

foreach ($xpath->query("//table[@class='waffle']//tr") as $row) {
    $columns = $xpath->query("./td", $row);

    $key_td = $columns->item(0);
    $value_td = $columns->item(1);
    echo "[" . $key_td->nodeValue . "]: " . $value_td->nodeValue . "\n";
}

?>

获取元素的innerHTML，而不是元素本身

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-04-27 19:16:57

使用`preg_replace` ！像这样：

或者！你可以像这样使用simple_html_dom.php ：

解决方案2
1 2016-04-27 19:50:57

获取元素的innerHTML，而不是元素本身

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-04-27 19:16:57

使用preg_replace ！ 像这样：

或者！ 你可以像这样使用simple_html_dom.php ：

解决方案2 1 2016-04-27 19:50:57

解决方案1
1 已采纳 2016-04-27 19:16:57

使用`preg_replace` ！像这样：

或者！你可以像这样使用simple_html_dom.php ：

解决方案2
1 2016-04-27 19:50:57