PHP從html頁面提取所有文本

Question

過去1個小時，我一直在撓頭。 有什么可靠的方法只能提取文本

html頁面中沒有其他內容（代碼，圖像，鏈接，樣式，腳本）。 我正在嘗試提取html文檔正文中的所有文本。

這包括段落，純文本和表格數據。

到目前為止，我已經嘗試了simplehtmldom解析器以及file_get_contents但是它們都不起作用。 這是代碼：

<?php

require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;

}

$html = file_get_contents('http://www.thefreedictionary.com/contempt');

echo getplaintextintrofromhtml($html);
?>

這是輸出的屏幕截圖：

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

如您所見，它顯示了奇怪的輸出，甚至沒有顯示整個頁面的文本

Answer 1

我認為PHP簡單HTML DOM解析器是嘗試http://simplehtmldom.sourceforge.net/的最快，最簡單的方法

features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line

Answer 2

我不為什么您會認為SimpleHTMLDOM不起作用，而您只需要正確使用它，只針對主體，然后使用->innertext屬性即可：

function getplaintextintrofromhtml($url) {
    include 'simple_html_dom.php';

    $html = file_get_html($url);
    // point to the body, then get the innertext
    $data = $html->find('body', 0)->innertext;
    return $data;
}

echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');

Answer 3

Html2Text就是一個很好的庫。

https://github.com/mtibben/html2text

使用composer安裝：

composer require html2text/html2text

基本用法：

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

PHP從html頁面提取所有文本

問題描述

3 個解決方案

解決方案1
1 2014-11-25 10:52:29

解決方案2
1 已采納 2014-11-25 10:55:49

解決方案3
0 2017-03-27 10:18:52

PHP從html頁面提取所有文本

問題描述

3 個解決方案

解決方案1 1 2014-11-25 10:52:29

解決方案2 1 已采納 2014-11-25 10:55:49

解決方案3 0 2017-03-27 10:18:52

解決方案1
1 2014-11-25 10:52:29

解決方案2
1 已采納 2014-11-25 10:55:49

解決方案3
0 2017-03-27 10:18:52