简体   繁体   中英

how to extract all text from HTML file using PHP?

how to extract all text from HTML file

I want to extract all text, in the alt attributes, < p > tags, etc..

however I don't want to extract the text between style and script tags

Thanks

right now I have the following code

    <?PHP
    $string =  trim(clean(strtolower(strip_tags($html_content))));
    $arr = explode(" ", $string);
    $count = array_count_values($arr);
    foreach($count as $value => $freq) {
          echo trim ($value)."---".$freq."<br>";
    }

    function clean($in){
           return preg_replace("/[^a-z]+/i", " ", $in);
    }

    ?>

This works great but it retrieves script and style tags which I don't want to retrieve and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes

Thanks

I personally think you should switch to an XML reader of some sort ( SimpleXML , Document Object Model or XMLReader ) to parse the HTML document. I'd go for a mix of DOM , SimpleXML and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...

首先删除具有全部内容的脚本和样式标签,然后使用当前的清洁标签方法,您将获得文本。

first you can search for the and blocks and remove them from the html.

i have this function i use alot

        function search($start,$end,$string, $borders=true){
            $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
            preg_match_all($reg,$string,$matches);

            if($borders) return $matches[0];    
            else return $matches[1];    
        }

the function will return matching blocks in array.

$array = search("<script>" , "</script>" , $html)

once you have the script and styles gone , use strip_tags to get the text

Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).

A simple preg_replace should suffice. Something like

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

should be enough to replace all the script and style elements and their contents with an empty string (ie strip them).

If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.

I posted this as an answer to another post, but here it is again:

We've just launched a new natural language processing API over at repustate.com . Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM