简体   繁体   English

使用Regex删除HTML标签和换行符

[英]Remove HTML tags and newline characters with Regex

I want to replace html tags and newline characters with a <br> tag. 我想用<br>标签替换html标签和换行符。 In order to do so, I have used the following code, but it does not replace \\r\\n . 为此,我使用了以下代码,但它不能代替\\r\\n

 const newText = text.replace(/<script.*?<\\/script>/g, '<br>') .replace(/<style.*?<\\/style>/g, '<br>') .replace(/(<([^>]+)>)/ig, "<br>") .replace(/(?:\\r\\n|\\r|\\n)/g, '<br>') 


An example of the text 文字示例

<div class="text-danger ng-binding" ng-bind-html="message.causedBy ">javax.xml.ws.soap.SOAPFaultException: Response was of unexpected text/html ContentType.  Incoming portion of HTML stream: \r\n\r\n\r\n\r\n500 - Internal server error.\r\n\r\n\r\n\r\n<div><h1>Server Error</h1></div>\r\n<div>\r\n <div class="\&quot;content-container\&quot;">\r\n  <h2>500 - Internal server error.</h2>\r\n  <h3>There is a problem with the resource you are looking for, and it cannot be displayed.</h3>\r\n </div>\r\n</div>\r\n\r\n\r\n\n\t</div>

I appreciate if you help me. 谢谢您的帮助。 (: (:

This works for me. 这对我有用。 Are your CRLFs '\\r' one escaped character or two characters, being '\\' and 'r'. 您的CRLF是“ \\ r”一个转义字符还是两个字符,分别是“ \\”和“ r”。

If you have HTML elements with characters \\n and \\r, they are literal, and that would be really odd inside a div unless you are displaying source code. 如果您具有带有\\ n和\\ r字符的HTML元素,则它们是文字的,并且在div中除非您显示源代码,否则这真的很奇怪。 Plain ol' line breaks will end up as expected with a single escape character. 普通ol'换行符将按预期以单个转义符结束。

Also ,it's not clear if your source is getting pulled from an element or is static text. 另外,还不清楚源是从元素中提取还是静态文本。

You might have to escape the literal case in your regex. 您可能必须在正则表达式中转义字面大小写。

replace(/(?:\\r\\n|\\r|\\n)/g, '<br>')

 const text = ` <div class="text-danger ng-binding" ng-bind-html="message.causedBy ">javax.xml.ws.soap.SOAPFaultException: Response was of unexpected text/html ContentType. Incoming portion of HTML stream: \\r\\n\\r\\n\\r\\n\\r\\n500 - Internal server error.\\r\\n\\r\\n\\r\\n\\r\\n<div><h1>Server Error</h1></div>\\r\\n<div>\\r\\n <div class="\\&quot;content-container\\&quot;">\\r\\n <h2>500 - Internal server error.</h2>\\r\\n <h3>There is a problem with the resource you are looking for, and it cannot be displayed.</h3>\\r\\n </div>\\r\\n</div>\\r\\n\\r\\n\\r\\n\\n\\t</div>` const newText = text .replace(/<script.*?<\\/script>/g, '<br>') .replace(/<style.*?<\\/style>/g, '<br>') .replace(/(<([^>]+)>)/ig, "<br>") .replace(/(?:\\r\\n|\\r|\\n)/g, '<br>') //.replace(/(?:\\\\r\\\\n|\\\\r|\\\\n)/g, '<br>') console.log(newText) const text2 = document.getElementById('text').innerHTML const newText2 = text2 .replace(/<script.*?<\\/script>/g, '<br>') .replace(/<style.*?<\\/style>/g, '<br>') .replace(/(<([^>]+)>)/ig, "<br>") .replace(/(?:\\r\\n|\\r|\\n)/g, '<br>') //.replace(/(?:\\\\r\\\\n|\\\\r|\\\\n)/g, '<br>') console.log(newText2) 
 <div id='text'> This is <script>// nothing here </script> a div These are literal \\r\\n\\r\\n and will not get escaped unless you uncomment the special case. </div> 

You can't parse [X]HTML with regex. 您无法使用正则表达式解析[X] HTML。 Because HTML can't be parsed by regex. 因为正则表达式无法解析HTML。 Regex is not a tool that can be used to correctly parse HTML. 正则表达式不是可用于正确解析HTML的工具。

And so on. 等等。

Instead, you have a parser at your fingertips. 相反,您可以轻松使用解析器。 Use it! 用它!

var tmp = document.createElement('div');
tmp.innerHTML = text;

// replace all start/end tags with <br> for... some reason, I guess!
Array.from(tmp.getElementsByTagName("*")).forEach(function(elem) {
    // ignore <br> tags
    if( elem.nodeName.match(/^br$/i)) {
        // do nothing
    }
    // outright remove <script> and <style>
    else if( elem.nodeName.match(/^(?:script|style)$/i)) {
        elem.parentNode.replaceChild(document.createElement('br'), elem);
    }
    // replace element with its contents and place a <br> before and after
    else {
        elem.parentNode.insertBefore(document.createElement('br'), elem);
        while(elem.firstChild) {
            elem.parentNode.insertBefore(elem.firstChild, elem);
        }
        elem.parentNode.replaceChild(document.createElement('br'), elem);
    }
});

var html = tmp.innerHTML;
// since replacing newlines with <br> is a string operation, go ahead and use regex for that
html = html.replace(/\r?\n/,"<br />");

Just replace meverything that matches that pattern (<[^>]+>|\\r|\\n) with empty string. 只需将与该模式(<[^>]+>|\\r|\\n)匹配的所有内容替换为空字符串即可。

It is simple alternation, where \\r is carriage return, \\n is newline character (so it surely removes all new line characters which sometimes are imbinations of \\r and \\n ). 这是简单的交替,其中\\r是回车符, \\n是换行符(因此,它肯定会删除有时是\\r\\n组合的所有换行符)。

<[^>]+> will match every HTML tag. <[^>]+>将匹配每个HTML标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM