简体   繁体   中英

removing \r \n and escape characters from html

I have the following html, which I have extracted from an email using imap_fetchbody,

<div dir=\"ltr\"><br><div class=\"gmail_quote\"><div dir=\"ltr\"><br><div class=\"gmail_quote\"><div class=\"\">
---------- Forwarded message ----------<br>
<span style=\"font-family:&quot;Helvetica&quot;,&quot;sans-serif&quot;\"><\/span>
From: <span style=\"font-family:&quot;Helvetica&quot;,&quot;sans-serif&quot;\">&quot;
<span>xyz<\/span>&quot; &lt;<a href=\"mailto:support@xyz.com\" target=\"_blank\">support@<span>xyz<\/span>.com<\/a>&gt;<\/span><br>
\r\n\r\n\r\n\r\nDate: Fri, Apr 18, 2014 at 7:17 PM<br>
Subject: Bla bla xyz<br><\/div><div><div class=\"h5\">To: XYZ &lt;<a href=\"mailto:xyz@gmail.com\" target=\"_blank\">xyz@gmail.com<\/a>&gt;<br><br><br>\r\n\r\n<div dir=\"ltr\">\r\n\r\n\r\n\r\n
<div class=\"gmail_quote\"><div><div><div dir=\"ltr\"><div class=\"gmail_quote\"><div dir=\"ltr\"><div><div class=\"gmail_quote\">
<div dir=\"ltr\"><div><div><div class=\"gmail_quote\"><div style=\"word-wrap:break-word\" lang=\"EN-US\">\r\n\r\n\r\n\r\n
<div>
<div>
<div>
<blockquote style=\"margin-top:5pt;margin-bottom:5pt\">
<div><div>
<table style=\"width:100%;background:none repeat scroll 0% 0% rgb(207,207,207)\" cellpadding=\"0\" cellspacing=\"0\" border=\"0\" width=\"100%\">
<tbody>
<tr>\r\n\r\n\r\n\r\n
<td style=\"width:325pt;padding:0in\" width=\"650\">\r\n\r\n<div align=\"center\"><table style=\"width:325pt;background:none repeat scroll 0% 0% rgb(207,207,207)\" cellpadding=\"0\" cellspacing=\"0\" border=\"0\" width=\"650\">\r\n\r\n\r\n\r\n
<tbody><tr>
<td style=\"padding:0in 0in 5.25pt\"><p style=\"text-align:center\" align=\"center\">
<span style=\"font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:rgb(64,64,64)\">If you are unable to see this message, 
<a href=\"http:\/\/click.e.xyz.com\/?qs=3771d7c90c958f02a4b2e78494f12a3116ddb15df79b8d04cdf5aeba42012b118\" target=\"_blank\">
<span style=\"color:rgb(64,64,64)\">click here<\/span><\/a> to view.<br>
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nTo ensure delivery to your inbox, please add <a href=\"mailto:support@xyz.com\" target=\"_blank\">support@xyz.com<\/a> to your address book. <\/span><\/p>
<\/td>
<\/tr>
<\/tbody>
<\/table>
<\/div><\/div><\/div><\/div>

I want to get rid of all the \\ , \\r , \\n and still keep < and > of the html as is. I have tried stripslashes, stripcslashes, nl2br, htmlspecialchars_decode. But I am not able to achieve what I want. Here is what I have tried along with imap_qprint function,

$text = stripslashes(imap_qprint($text));
$body = preg_replace('/(\v|\s)+/', ' ', $text );

Res: It doesn't remove all the white space characters.

Match the following regex:

(\\\\r|\\\\n|\\\\) with the g modifier

and replace with

'' (empty string)

Demo: http://regex101.com/r/mS3wM2

$html = preg_replace('/[\\\\\r\n]/', '', $html);

Match a single character present in the list below «[\\\r\n]»
   A \ character «\\»
   A carriage return character «\r»
   A line feed character «\n»

UPDATE:

Based on your comment I've updated my answer:

$html = preg_replace('%\\\\/%sm', '/', $html);
$html = preg_replace('/\\\\"/sm', '"', $html);
$html = preg_replace('/[\r\n]/sm', '', $html);

If string functions can do the trick, always favor stringfunctions above regex´s . Performace/speed will be better compared to regex's, and they's easier to read in the code:

$message = str_replace("\r\n", '', $message ); // replace all newlines, use double quotes!
$message = stripslashes( $message );

First you have to remove the newlines. As far as I can tell, the \\r and \\n always come together, so I replace them in 1 go. After that, the stripslashes will remove all escaping slashes.
You have to the the stripslashes after the newlines, else \\r\\n would result in rn , making them harder to find


This works perfect in my tests:

echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';
echo '<hr />';

$message = str_replace("\r\n", '', $message); // use double quotes!
echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';
echo '<hr />';

$message = stripslashes($message);
echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';

You could use something like this to interpret the escape sequences:

function interpret_escapes($str) {
    return preg_replace_callback('/\\\\(.)/u', function($matches) {
        $map = ['n' => "\n", 'r' => "\r", 't' => "\t", 'v' => "\v", 'e' => "\e", 'f' => "\f"];
        return isset($map[$matches[1]]) ? $map[$matches[1]] : $matches[1];
    }, $str);
}

If you can open the file in vi, it would be as easy as:

%s/\\\\r\\|\\\\n//g

on vi cmd mode

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM