简体   繁体   English

如何使用PHP检索HTML标记内的纯文本?

[英]How can I retrieve clean text inside HTML tags using PHP?

I have a form which is accepts HTML data, but we need only their respective text, not anything else. 我有一个接受HTML数据的表单,但是我们只需要它们各自的文本,而无需其他任何文本。 Is there any particular way to extract the text out of the HTML in PHP? 有没有什么特殊的方法可以从PHP的HTML中提取文本?

使用strip_tags()

Surely it can be done: 当然可以做到:

Just look at this function and use it as you like: 只需查看此功能并根据需要使用即可:

function html2txt ($document)
{
    $search = array (
            "'<script[^>]*?>.*?</script>'si", // Strip out JavaScript code
            "'<[\/\!]*?[^<>]*?>'si",          // Strip out HTML tags
            "'([\r\n])[\s]+'",                // Strip out white space
            "'@<![\s\S]*?�[ \t\n\r]*>@'",   
            "'&(quot|#34|#034|#x22);'i",      // Replace HTML entities
            "'&(amp|#38|#038|#x26);'i",       // Added hexadecimal values
            "'&(lt|#60|#060|#x3c);'i",
            "'&(gt|#62|#062|#x3e);'i",
            "'&(nbsp|#160|#xa0);'i",
            "'&(iexcl|#161);'i",
            "'&(cent|#162);'i",
            "'&(pound|#163);'i",
            "'&(copy|#169);'i",
            "'&(reg|#174);'i",
            "'&(deg|#176);'i",
            "'&(#39|#039|#x27);'",
            "'&(euro|#8364);'i",         // Europe
            "'&a(uml|UML);'",            // German
            "'&o(uml|UML);'",
            "'&u(uml|UML);'",
            "'&A(uml|UML);'",
            "'&O(uml|UML);'",
            "'&U(uml|UML);'",
            "'&szlig;'i",
            );
    $replace = array (    "",
                "",
                " ",
                "\"",
                "&",
                "<",
                ">",
                " ",
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                chr(174),
                chr(176),
                chr(39),
                chr(128),
                "ä",
                "ö",
                "ü",
                "�",
                "�",
                "�",
                "�",
            );

    $text = preg_replace($search, $replace, $document);

    return trim ($text);
}

You can parse the HTML using DOMDocument::loadHTMLFile and extract what you need. 您可以使用DOMDocument::loadHTMLFile解析HTML并提取所需的内容。

$doc = new DOMDocument();
$doc->loadHTMLFile("data.html");
$metaTags = $doc->getElementsByTagName('meta');
// Process $metaTags

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM