简体   繁体   中英

Trying to remove HTML tags (+ content) from String

OK, so basically I'm about to bang my head against the wall with this one.

Here's the code :

<?php

$s = "385,178<ref name=\"land area\">Data is accessible by following \"Create tables and diagrams\" link on the following site, and then using table 09280 \"Area of land and fresh water (km²) (M)\" for \"The whole country\" in year 2013 and summing up entries \"Land area\" and \"Freshwater\": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>";

function removeHTMLTags($str) { 
    $r = '/(\\<br\\>)|(\\<br\/\\>)|(\\<(.+?)(\\s*[^\\<]+)?\\>(.+)?\\<\\\\\/\\1\\>)|(\\<ref\\sname=([^\\<]+?)\/\\>)/';

    echo "Preg_matching : $str\n\n";
    echo "Regex : $r\n\n";

    return preg_replace($r,'',$str); 
}

echo removeHTMLTags($s);

?>

What I'm trying to do, is basically get rid of the <ref name="... </ref> part (and all possible tags as well) .

However, this is what I'm getting

(aka exactly the same string, with nothing being replaced whatsoever ) :

Preg_matching : 385,178<ref name="land area">Data is accessible by following "Create tables and diagrams" link on the following site, and then using table 09280 "Area of land and fresh water (km²) (M)" for "The whole country" in year 2013 and summing up entries "Land area" and "Freshwater": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>

Regex : /(\<br\>)|(\<br\/\>)|(\<(.+?)(\s*[^\<]+)?\>(.+)?\<\\\/\1\>)|(\<ref\sname=([^\<]+?)\/\>)/

385,178<ref name="land area">Data is accessible by following "Create tables and diagrams" link on the following site, and then using table 09280 "Area of land and fresh water (km²) (M)" for "The whole country" in year 2013 and summing up entries "Land area" and "Freshwater": {{cite web |url=http://www.ssb.no/en/natur-og-miljo/statistikker/arealdekke |title=Area of land and fresh water, 1 January 2013 |publisher=[[Statistics Norway]] |date=28 May 2013 |accessdate=23 November 2013}}</ref>

So, the question is : what am I doing wrong wrong ? (I've tested the regex with RegExr multiple times, and it does seem to be working - am I messing it up with the... escapes?)


PS For those of you who know what I'm talking about : yep, that's a portion of Wikipedia Infobox.

You really should use the DOM for this kind of stuff, because other solutions tend to break easily:

$dom = new DOMDOcument();
$errorState = libxml_use_internal_errors(true);
$dom->loadHTML($s);

$xpath = new DOMXPath($dom);
$node = $xpath->query('//body/p/text()')->item(0);
echo $node->textContent;

libxml_use_internal_errors($errorState);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM