Multiple matches with regular expression not being returned

Question

I am using TinyMCE and it is converting all my attribute single quotes to double quotes on cleanup.

This is what I am putting into the editor.

<tr _excel-dimensions='{"row":{"rowHeight":50}}'>
<td _excel-styles='{"font":{"size":20,"color":{"rgb":"333333"},"bold":true},"fill":{"fillType":"solid","startColor":"F0F0F0"},"alignment":{"horizontal":"center"}}' colspan='6'>Affiliate Accounts</td>
</tr>

and this is what the editor does after saving it

<tr _excel-dimensions="{&quot;row&quot;:{&quot;rowHeight&quot;:50}}">
<td _excel-styles="{&quot;font&quot;:{&quot;size&quot;:20,&quot;color&quot;:{&quot;rgb&quot;:&quot;333333&quot;},&quot;bold&quot;:true},&quot;fill&quot;:{&quot;fillType&quot;:&quot;solid&quot;,&quot;startColor&quot;:&quot;F0F0F0&quot;},&quot;alignment&quot;:{&quot;horizontal&quot;:&quot;center&quot;}}" colspan="6">Accounts</td>
</tr>

There doesn't seem to be a way to override the setting in TinyMCE. So I am turning to RegEx with PHP when saving the data to the database. This is what I have so far, but doesn't seem to be capturing all the double quotes.

$content = preg_replace_callback('/<(.*)(\")(.*)(\")(.*)>/miU', function($matches) {
  return "<" . $matches[1] . "'" . html_entity_decode($matches[3]) . "'" . $matches[5] . ">";
}, $content);

It is replacing the json encoded string, but not the colspan="6"

Thanks in advance for the help.

Answer 1

As I said in the comment, it's not very good to parse HTML with regex, better to use special libraries like PHP Simple HTML DOM Parser . However it's possible to construct a regex which will work on a correct HTML.

Our goal is to find all double-quoted strings inside a tag. First let's forget about requirement that the double-quoted string must be inside a tag. Then we can use this:

$content = preg_replace_callback('/"(.*?)"/', 
  function($matches) {
    return "'" . html_entity_decode($matches[1]) . "'" 
  }, 
  $content);

Now we need to add the check that the double-quoted string is inside a tag. To do this we construct a lookahead expression which checks the text between our double-quoted string and the end of the text:

there must be a tag-closing > there. It means that there must be some sequence of non- < , non- > characters followed by > . The corresponding regex is [^<>]*>
it must be followed by any number of complete tags < and > . The regex for a group of characters containing a single tag is [^<]*<[^>]*> . We need to repeat this group any number of times: (?:[^<]*<[^>]*>)*
there might be some non- < , non- > characters till the end of the text: [^<>]*$

The resulting lookahead expression looks a bit terrifying, but does the work: (?=[^<>]*>(?:[^<]*<[^>]*>)*[^<>]*$) .

Finally, we incorporate this lookahead check into the original regex:

$content = preg_replace_callback('/"(?=[^<>]*>(?:[^<]*<[^>]*>)*[^<>]*$)(.*?)"/', 
  function($matches) {
    return "'" . html_entity_decode($matches[1]) . "'" 
  }, 
  $content);

You can check it here: Regex101 demo

Multiple matches with regular expression not being returned

Question

1 answers

solution1
1 ACCPTED 2020-02-27 08:14:15

Multiple matches with regular expression not being returned

Question

1 answers

solution1 1 ACCPTED 2020-02-27 08:14:15

solution1
1 ACCPTED 2020-02-27 08:14:15