跳过php正则表达式中的html标签

Question

我是正确英语的坚持者（是的，我知道“stickler”和“correct-ish”是矛盾的）。 我已经在我公司的网站上创建了一个CMS，但有一件事情让我很紧张 - 在已发布的内容中创建“智能”引号。

我有一个reg-ex可以做到这一点，但是当我在副本中遇到html标签时遇到了问题。 例如，我的CMS使用的一个已发布的故事可能包含一堆纯文本和一些HTML标记，例如链接标记，其中包含引号我不想因为显而易见的原因而改为“智能”引号。

15年前，我是一名Perl RegEx王牌，但我完全不知道这一点。 我想要做的是处理一个字符串，忽略html标签内的所有文本，用“smart”引号替换字符串中的所有引号，然后返回其html标签完整的字符串。

我有一个功能，我拼凑在一起处理我遇到的CMS最常见的情况，但我讨厌它是丑陋的，并不优雅，如果出现无法预料的标签，我的解决方案完全打破。

这是代码（请不要笑，它被砸在半瓶苏格兰威士忌）：

function educate_quotes($string) {
        $pattern = array('/\b"/',//right double
                        '/"\b/',//left double
                        '/"/',//left double end of line
                        "/(\w+)'(\w+)/",//apostrophe
                        "/\b'/",//left single
                        "/'\b/",//right single
                        "/'$/",//right single end of line
                        "/--/"//emdash
                        );

        $replace = array("&#8221;",//right double quote
                        "&#8220;",//left double
                        "&#8221;",//left double end of line
                        "$1"."&#8217;"."$2",//apostrophe
                        "&#8217;",//left single
                        "&#8216;",//right single
                        "&#8217;",//right single end of line
                        "&#151;"//emdash
                        );

        $string =  preg_replace($pattern,$replace,$string);
        //remove smart quotes around urls
        $string = preg_replace("/href=&#8220;(.+)&#8221;/","href=\"$1\"",$string);
        //remove smart quotes around images
        $string = preg_replace("/src=&#8220;(.+?)&#8221;/","src=\"$1\" ",$string);
        //remove smart quotes around alt tags
        $string = str_replace('alt=&#8221;"','',$string);
        $pat = "/alt=&#8220;(.+?)&#8221;/is";
        $rep = "alt=\"$1\" ";
        $string = preg_replace($pat,$rep,$string);
        //i'm too lazy to figure out why this artifact keeps appearing
        $string = str_replace("alt=&#8220;",'alt="',$string);
        //same thing here
        $string = preg_replace("/&#8221; target/","\" target",$string);
        return $string;
    }

就像我说的，我知道代码是丑陋的，我对更优雅的解决方案持开放态度。 它有效，但在将来，如果出现无法预料的标签，它将会中断。 为了记录，我想重申一下，我并不是要试图获得PARSE html标签的正则表达式; 我正在尝试将它解析为IGNORE，同时解析字符串中的所有其余文本。

有解决方案吗 我已经做了很多在线搜索，似乎无法找到解决方案，而且我对PHP的正则表达式的实现已经不熟悉了，这是令人遗憾的。

Answer 1

好。 在Slacks建议DOM解析之后，我回答了我自己的问题，但是现在我遇到了正则表达式不能处理创建的字符串的问题。 这是我的代码：

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

DOM解析工作正常; 问题是现在我的正则表达式（我已经从上面的那个改变了，但直到上面的那个已经没有处理创建的新字符串）实际上并没有替换字符串中的任何引号。

此外，当字符串中包含不完整的HTML代码时，我收到以下恼人的警告：

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418

由于我不能指望记者总是使用完美的HTML代码，这也是一个问题。

Answer 2

是否可以基于html < >标签进行拆分，然后将其重新组合在一起？

$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));

所以你得到的是：

Array
(
    [0] => 
    [1] => <div sdfas="sdfsd" >
    [2] => ksdfsdf"dfsd" dfs 
    [3] => </div>
    [4] =>  
    [5] => <span sdf='dsfs'>
    [6] =>  dfsd 'dsf ds' 
    [7] => </span>
    [8] =>  
)

那么你可以做的就是将整个事物重新组合在一起，同时使用preg_replace，如果它没有< > 。

Answer 3

使用A. Lau的建议，我认为我有一个解决方案，结果它实际上是正则表达式，而不是xml解析器。

这是我的代码：

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

该代码输出以下内容：

阵列（

>     [0] => 
>     [1] => <p>
>     [2] => "This" 
>     [3] => <b>
>     [4] => is
>     [5] => </b>
>     [6] =>  a "string" with 
>     [7] => <a href="http://somewhere.com">
>     [8] => quotes
>     [9] => </a>
>     [10] =>  in it. 
>     [11] => <img src="blah.jpg" alt="This is an alt tag">
>     [12] => 
>     [13] => </p>
>     [14] => 
>     [15] => <p>
>     [16] => Whatever, you know?
>     [17] => </p>
>     [18] => >
> Whatever, you know?

“这个”是一个带有引号的“字符串”。 这是一个alt标签

无论如何，你知道吗？

跳过php正则表达式中的html标签

问题描述

3 个解决方案

解决方案1
0 2016-09-09 03:09:39

解决方案2
0 2016-09-09 07:00:23

解决方案3
0 2016-09-10 04:50:00

跳过php正则表达式中的html标签

问题描述

3 个解决方案

解决方案1 0 2016-09-09 03:09:39

解决方案2 0 2016-09-09 07:00:23

解决方案3 0 2016-09-10 04:50:00

解决方案1
0 2016-09-09 03:09:39

解决方案2
0 2016-09-09 07:00:23

解决方案3
0 2016-09-10 04:50:00