简体   繁体   中英

php regular expression to match string if NOT in an HTML tag

I'm trying to solve this bug in Drupal's Hashtags module: http://drupal.org/node/1718154

I've got this function that matches every word in my text that is prefixed by "#", like #tag:

function hashtags_get_tags($text) {
    $tags_list = array();
    $pattern = "/#[0-9A-Za-z_]+/";
    preg_match_all($pattern, $text, $tags_list);
    $result = implode(',', $tags_list[0]);
    return $result;
    }

I need to ignore internal links in pages, such as <a href="#reference">link</a> , or, more in general, any word prefixed by # that appears inside an HTML tag (so preceeded by < and followed by >).

Any idea how can I achieve this?

Can you strip the tags first because matching (using the strip_tags function)?

function hashtags_get_tags($text) {

    $text = strip_tags($text);

    $tags_list = array();
    $pattern = "/#[0-9A-Za-z_]+/";
    preg_match_all($pattern, $text, $tags_list);
    $result = implode(',', $tags_list[0]);
    return $result;
}

A regular expression is going to be tricky if you want to only match hashtags that are not inside an HTML tag.

You could throw out the tags before hand using preg_replace

function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
$text=preg_replace("/<[^>]*>/","",$text);
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}

I made this function using PHP DOM .

It returns all links that have # in the href .

If you want it to only remove internal hash tags, replace this line:

if(strpos($link->getAttribute('href'), '#') === false) {

with this:

if(strpos($link->getAttribute('href'), '#') !== 0) {

This is the function:

function no_hashtags($text) {
    $doc = new DOMDocument();
    $doc->loadHTML($text);
    $links = $doc->getElementsByTagName('a');
    $nohashes = array();
    foreach($links as $link) {
        if(strpos($link->getAttribute('href'), '#') === false) {
            $temp = new DOMDocument();
            $elem = $temp->importNode($link->cloneNode(true), true);
            $temp->appendChild($elem);
            $nohashes[] = $temp->saveHTML();
        }
    }
    // return $nohashes;
    return implode('', $nohashes);
    // return implode(',', $nohashes);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM