简体   繁体   English

PHP正则表达式帮助

[英]PHP regular expression help

I am using preg_replace to strip out <p> tags and <li> tags and making them carriage returns. 我正在使用preg_replace去除<p>标签和<li>标签,并使它们回车。 I have some <a > tags in my string, and I want to strip those out, but keep the href attribute. 我的字符串中有一些<a标记,我想删除它们,但保留href属性。 For instance, if I have: <a href = "http://www.example.com">Click Here</a> , what I want is: http://www.example.com Click Here 例如,如果我有: <a href = "http://www.example.com">Click Here</a> ,我想要的是: http://www.example.com : http://www.example.com单击此处

Here is what I have so far 这是我到目前为止的

$text .= preg_replace(array("/<p[^>]*>/iU","/<\/p[^>]*>/iU","/<ul[^>]*>/iU","/<\/ul[^>]*>/iU","/<li[^>]*>/iU","/<\/li[^>]*>/iU"), array("","\r\n\r\n","","\r\n\r\n","","\r\n"), $content);

Thanks 谢谢

If I were you I would use SimpleHTMLDom . 如果我是你,我将使用SimpleHTMLDom Here's a usage example from the docs: 这是文档中的用法示例:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; 
// Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

If a regex solution is desired, here is a tested function which handles the anchor tags as you requested (with notable caveats noted below.) The regex is presented in verbose mode with comments: 如果需要正则表达式解决方案,则下面是一个经过测试的函数,可以根据您的要求处理锚标记(以下为值得注意的注意事项。)正则表达式以详细模式显示,并带有注释:

function process_markup($content) {
    return preg_replace(
        array( // Regex patterns
            '%<(?:p|ul|li)[^>]*>%i',        // Open tags.
            '%<\/(?:p|ul|li)[^>]*>\s*%i',   // Close tags.
            '% # Match A element (with no "<>" in attributes!)
            <a\b         # Start tag name.
            [^>]+?       # anything up to HREF attribute.
            href\s*=\s*  # HREF attribute name and "="
            (["\']?)     # $1: Optional quote delimiter
            ([^>\s]+)    # $2: HREF attribute value.
            (?(1)\1)     # If open quote, match close quote.
            [^>]*>       # Remainder of start tag
            (.*?)        # $3: A element contents.
            </a\s*>      # A element end tag.
            %ix'
        ),
        array( // Replacement strings
            "",          # Simply strip P, UL, and LI open tags.
            "\r\n",      # Replace close tags with line endings.
            "$2 $3"      # Keep A element HREF value and contents.
        ), $content);
}

I took the liberty of modifying the other regexes as well. 我也自由地修改了其他正则表达式。 Adjust as necessary. 根据需要进行调整。

CAVEATS: This regex solution assumes: All A , P , UL and LI elements have no angle brackets <> in their attributes. 注释:此正则表达式解决方案假定:所有APULLI元素的属性中都没有尖括号<> There are no A , P , UL or LI element start or end tags within any CDATA sections such as SCRIPT or STYLE elements, or HTML comments, or inside other start tag attributes. 在任何CDATA节(例如SCRIPTSTYLE元素)或HTML注释中,或在其他开始标记属性内,都没有APULLI元素的开始或结束标记。 Otherwise, this should work pretty well for a lot of HTML markup. 否则,这对于许多HTML标记来说应该可以很好地工作。

I realize that many wince when they hear the words: HTML and REGEX spoken in the same breath, but in this particular case, I think a regex solution will work quite well (within the above limitations). 我意识到,很多人听到相同的话时都会感到REGEXHTMLREGEX是同时呼吸的,但是在这种特殊情况下,我认为regex解决方案会很好地工作(在上述限制内)。 The A tag is one of those which is not nested , so a regex can easily match the start tag, contents and end tag all in one whack. A标签是未嵌套 A标签之一,因此正则表达式可以轻松地将开始标签,内容和结束标签全部匹配在一起。 Same thing with the individual start and end tags for the other elements (which can be nested) when considered independently. 独立考虑其他元素( 可以嵌套)的单个开始标签和结束标签的情况相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM