skipping html tags in php regex

Question

I'm a stickler for correct-ish English (yes, I know "stickler" and "correct-ish" are an oxymoron). I have created a CMS for use on my company's sites, but there is one thing that is really on my nerves - creating "smart" quotes in the published content.

I have a reg-ex that does it, but I run into problems when I encounter html tags in the copy. For instance, one of the published stories used by my CMS may contain a bunch of plain text and a few HTML tags, such as a link tag, which contains quotation marks that I do NOT want to change to "smart" quotes for obvious reasons.

15 years ago, I was a Perl RegEx ace, but I'm totally drawing a blank on this one. What I want to do is process a string, ignoring all text inside html tags, replace all quotes in the string with "smart" quotes, then return the string with its html tags intact.

I have a function that I cobbled together to handle the most common scenarios I face with the CMS, but I hate that it's ugly and not elegant at all, and that if unforeseen tags come up, my solution completely breaks.

Here's the code (please don't laugh, it was slammed together over half a bottle of Scotch):

function educate_quotes($string) {
        $pattern = array('/\b"/',//right double
                        '/"\b/',//left double
                        '/"/',//left double end of line
                        "/(\w+)'(\w+)/",//apostrophe
                        "/\b'/",//left single
                        "/'\b/",//right single
                        "/'$/",//right single end of line
                        "/--/"//emdash
                        );

        $replace = array("&#8221;",//right double quote
                        "&#8220;",//left double
                        "&#8221;",//left double end of line
                        "$1"."&#8217;"."$2",//apostrophe
                        "&#8217;",//left single
                        "&#8216;",//right single
                        "&#8217;",//right single end of line
                        "&#151;"//emdash
                        );

        $string =  preg_replace($pattern,$replace,$string);
        //remove smart quotes around urls
        $string = preg_replace("/href=&#8220;(.+)&#8221;/","href=\"$1\"",$string);
        //remove smart quotes around images
        $string = preg_replace("/src=&#8220;(.+?)&#8221;/","src=\"$1\" ",$string);
        //remove smart quotes around alt tags
        $string = str_replace('alt=&#8221;"','',$string);
        $pat = "/alt=&#8220;(.+?)&#8221;/is";
        $rep = "alt=\"$1\" ";
        $string = preg_replace($pat,$rep,$string);
        //i'm too lazy to figure out why this artifact keeps appearing
        $string = str_replace("alt=&#8220;",'alt="',$string);
        //same thing here
        $string = preg_replace("/&#8221; target/","\" target",$string);
        return $string;
    }

Like I said, I know the code is ugly, and I'm open to more elegant solutions. It works, but in the future, it will break if unforeseen tags come along. For the record, I want to reiterate that I'm not trying to get a regex to PARSE html tags; I'm trying to get it to IGNORE them while parsing all the rest of the text in the string.

Any solutions? I've done a LOT of online searching and can't seem to find the solution, and I'm unfamiliar enough with PHP's implementation of regex that it's consternating.

Answer 1

OK. I sort of answered my own question after Slacks suggested DOM parsing, but now i have the problem that the regex isn't working on the strings created. Here's my code:

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

The DOM parsing is working fine; the issue is now my regex (which I've changed from the one above, but not until the one above already wasn't working on the new strings created) isn't actually replacing any of the quotation marks in the strings.

Also, I'm getting the following annoying warnings when there is imperfect HTML code in the string:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418

Since I can't count on the reporters to always use perfect HTML code, that's a problem, too.

Answer 2

Is it possible to split based on html < > tags and then piece it back together?

$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));

so what you get is:

Array
(
    [0] => 
    [1] => <div sdfas="sdfsd" >
    [2] => ksdfsdf"dfsd" dfs 
    [3] => </div>
    [4] =>  
    [5] => <span sdf='dsfs'>
    [6] =>  dfsd 'dsf ds' 
    [7] => </span>
    [8] =>  
)

Then what you can do is just piece the entire thing back together, while using preg_replace, if it doesn't have a < > .

Answer 3

Using A. Lau's suggestion, I think I have a solution, and turned out it actually was regex, not xml parser.

Here's my code:

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

That code outputs the following:

Array (

>     [0] => 
>     [1] => <p>
>     [2] => "This" 
>     [3] => <b>
>     [4] => is
>     [5] => </b>
>     [6] =>  a "string" with 
>     [7] => <a href="http://somewhere.com">
>     [8] => quotes
>     [9] => </a>
>     [10] =>  in it. 
>     [11] => <img src="blah.jpg" alt="This is an alt tag">
>     [12] => 
>     [13] => </p>
>     [14] => 
>     [15] => <p>
>     [16] => Whatever, you know?
>     [17] => </p>
>     [18] => >
> Whatever, you know?

“This” is a “string” with quotes in it. This is an alt tag

Whatever, you know?

skipping html tags in php regex

Question

3 answers

solution1
0 2016-09-09 03:09:39

solution2
0 2016-09-09 07:00:23

solution3
0 2016-09-10 04:50:00

skipping html tags in php regex

Question

3 answers

solution1 0 2016-09-09 03:09:39

solution2 0 2016-09-09 07:00:23

solution3 0 2016-09-10 04:50:00

solution1
0 2016-09-09 03:09:39

solution2
0 2016-09-09 07:00:23

solution3
0 2016-09-10 04:50:00