简体   繁体   English

从文本 JavaScript 中剥离 HTML

[英]Strip HTML from Text JavaScript

Is there an easy way to take a string of html in JavaScript and strip out the html?有没有一种简单的方法可以在 JavaScript 中取出一串 html 并去除 html?

If you're running in a browser, then the easiest way is just to let the browser do it for you...如果您在浏览器中运行,那么最简单的方法就是让浏览器为您完成...

function stripHtml(html)
{
   let tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input).注意:正如人们在评论中指出的那样,如果您不控制 HTML 的源,最好避免这种情况(例如,不要在任何可能来自用户输入的内容上运行它)。 For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser .对于这些情况,您仍然可以让浏览器为您完成工作 - 请参阅 Saba 关于使用现在广泛使用的 DOMParser 的回答

myString.replace(/<[^>]*>?/gm, '');

Simplest way:最简单的方法:

jQuery(html).text();

That retrieves all the text from a string of html.从一串html中检索所有文本。

I would like to share an edited version of the Shog9 's approved answer .我想分享Shog9已批准答案的编辑版本。


As Mike Samuel pointed with a comment, that function can execute inline javascript codes.正如Mike Samuel在评论中指出的那样,该函数可以执行内联 javascript 代码。
But Shog9 is right when saying "let the browser do it for you..."但是Shog9说“让浏览器为你做这件事……”是对的。

so.. here my edited version, using DOMParser :所以..这是我编辑的版本,使用DOMParser

function strip(html){
   let doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

here the code to test the inline javascript:这里是测试内联javascript的代码:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Also, it does not request resources on parse (like images)此外,它不会在解析时请求资源(如图像)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)作为 jQuery 方法的扩展,如果您的字符串可能不包含 HTML(例如,如果您尝试从表单字段中删除 HTML)

jQuery(html).text();

will return an empty string if there is no HTML如果没有 HTML,将返回一个空字符串

Use:利用:

jQuery('<p>' + html + '</p>').text();

instead.反而。

Update: As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.更新:正如评论中所指出的,在某些情况下,如果html的值可能受到攻击者的影响,此解决方案将执行包含在html中的 javascript,请使用不同的解决方案。

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact将 HTML 转换为纯文本电子邮件,保持超链接 (a href) 不变

The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text). hypoxide 发布的上述函数工作正常,但我想要的东西基本上可以转换在 Web RichText 编辑器(例如 FCKEditor)中创建的 HTML 并清除所有 HTML 但保留所有链接,因为我想要 HTML 和纯文本版本,以帮助为 STMP 电子邮件(HTML 和纯文本)创建正确的部分。

After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:在我自己和我的同事在 Javascript 中使用正则表达式引擎搜索 Google 很长时间之后,我想出了这个:

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

the str variable starts out like this: str变量的开头是这样的:

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

and then after the code has run it looks like this:-然后在代码运行后它看起来像这样:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact.正如您所看到的,所有的 HTML 都已被删除,并且链接已被保留,超链接文本仍然完好无损。 Also I have replaced the <p> and <br> tags with \n (newline char) so that some sort of visual formatting has been retained.我还用\n (换行符)替换了<p><br>标签,以便保留某种视觉格式。

To change the link format (eg. BBC (Link->http://www.bbc.co.uk) ) just edit the $2 (Link->$1) , where $1 is the href URL/URI and the $2 is the hyperlinked text.要更改链接格式(例如BBC (Link->http://www.bbc.co.uk) ),只需编辑$2 (Link->$1) ,其中$1是 href URL/URI 而$2是超链接文本。 With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.大多数 SMTP 邮件客户端使用纯文本正文中的链接直接转换这些链接,以便用户能够单击它们。

Hope you find this useful.希望您觉得这个有帮助。

An improvement to the accepted answer.对已接受答案的改进。

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

This way something running like this will do no harm:这样运行的东西不会有任何害处:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium and Explorer 9+ are safe. Firefox、Chromium 和 Explorer 9+ 是安全的。 Opera Presto is still vulnerable. Opera Presto 仍然很脆弱。 Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.字符串中提到的图像也不会在 Chromium 和 Firefox 中下载保存 http 请求。

This should do the work on any Javascript environment (NodeJS included).这应该可以在任何 Javascript 环境(包括 NodeJS)上工作。

    const text = `
    <html lang="en">
      <head>
        <style type="text/css">*{color:red}</style>
        <script>alert('hello')</script>
      </head>
      <body><b>This is some text</b><br/><body>
    </html>`;
    
    // Remove style tags and content
    text.replace(/<style[^>]*>.*<\/style>/gm, '')
        // Remove script tags and content
        .replace(/<script[^>]*>.*<\/script>/gm, '')
        // Remove all opening, closing and orphan HTML tags
        .replace(/<[^>]+>/gm, '')
        // Remove leading spaces and repeated CR/LF
        .replace(/([\r\n]+ +)+/gm, '');

I altered Jibberboy2000's answer to include several <BR /> tag formats, remove everything inside <SCRIPT> and <STYLE> tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal.我更改了Jibberboy2000 的答案以包含几种<BR />标记格式,删除<SCRIPT><STYLE>标记内的所有内容,通过删除多个换行符和空格来格式化生成的 HTML,并将一些 HTML 编码的代码转换为正常的。 After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.经过一些测试,您似乎可以将大部分完整网页转换为保留页面标题和内容的简单文本。

In the simple example,在简单的例子中,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body {margin-top: 15px;}
    a { color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
    <center>
        This string has <i>html</i> code i want to <b>remove</b><br>
        In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                 
    </center>
</body>
</html>

becomes变成

This is my title这是我的标题

This string has html code i want to remove这个字符串有我要删除的 html 代码

In this line BBC ( http://www.bbc.co.uk ) with link is mentioned.在这一行中,提到了带有链接的 BBC ( http://www.bbc.co.uk )。

Now back to "normal text" and stuff using现在回到“普通文本”和使用的东西

The JavaScript function and test page look this: JavaScript 函数和测试页面如下所示:

function convertHtmlToText() {
    var inputText = document.getElementById("input").value;
    var returnText = "" + inputText;

    //-- remove BR tags and replace them with line break
    returnText=returnText.replace(/<br>/gi, "\n");
    returnText=returnText.replace(/<br\s\/>/gi, "\n");
    returnText=returnText.replace(/<br\/>/gi, "\n");

    //-- remove P and A tags but preserve what's inside of them
    returnText=returnText.replace(/<p.*>/gi, "\n");
    returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

    //-- remove all inside SCRIPT and STYLE tags
    returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
    //-- remove all else
    returnText=returnText.replace(/<(?:.|\s)*?>/g, "");

    //-- get rid of more than 2 multiple line breaks:
    returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

    //-- get rid of more than 2 spaces:
    returnText = returnText.replace(/ +(?= )/g,'');

    //-- get rid of html-encoded characters:
    returnText=returnText.replace(/&nbsp;/gi," ");
    returnText=returnText.replace(/&amp;/gi,"&");
    returnText=returnText.replace(/&quot;/gi,'"');
    returnText=returnText.replace(/&lt;/gi,'<');
    returnText=returnText.replace(/&gt;/gi,'>');

    //-- return
    document.getElementById("output").value = returnText;
}

It was used with this HTML:它与此 HTML 一起使用:

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

This is a regex version, which is more resilient to malformed HTML, like:这是一个正则表达式版本,它对格式错误的 HTML 更具弹性,例如:

Unclosed tags未封闭的标签

Some text <img

"<", ">" inside tag attributes标签属性内的“<”、“>”

Some text <img alt="x > y">

Newlines换行符

Some <a href="http://google.com">

The code编码

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

from CSS tricks:来自 CSS 技巧:

https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/ https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

 const originalString = ` <div> <p>Hey that's <span>somthing</span></p> </div> `; const strippedString = originalString.replace(/(<([^>]+)>)/gi, ""); console.log(strippedString);

Another, admittedly less elegant solution than nickf's or Shog9's, would be to recursively walk the DOM starting at the <body> tag and append each text node.另一个显然不如 nickf 或 Shog9 优雅的解决方案是从 <body> 标记开始递归遍历 DOM 并附加每个文本节点。

var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);

function appendTextNodes(element) {
    var text = '';

    // Loop through the childNodes of the passed in element
    for (var i = 0, len = element.childNodes.length; i < len; i++) {
        // Get a reference to the current child
        var node = element.childNodes[i];
        // Append the node's value if it's a text node
        if (node.nodeType == 3) {
            text += node.nodeValue;
        }
        // Recurse through the node's children, if there are any
        if (node.childNodes.length > 0) {
            appendTextNodes(node);
        }
    }
    // Return the final result
    return text;
}

If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.如果您想保留链接和内容的结构(h1、h2 等),那么您应该查看TextVersionJS您可以将它与任何 HTML 一起使用,尽管它是为了将 HTML 电子邮件转换为纯文本而创建的。

The usage is very simple.用法很简单。 For example in node.js:例如在 node.js 中:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Or in the browser with pure js:或者在纯js的浏览器中:

<script src="textversion.js"></script>
<script>
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
</script>

It also works with require.js:它也适用于 require.js:

define(["textversionjs"], function(createTextVersion) {
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
});

It is also possible to use the fantastic htmlparser2 pure JS HTML parser.也可以使用神奇的htmlparser2纯 JS HTML 解析器。 Here is a working demo:这是一个工作演示:

var htmlparser = require('htmlparser2');

var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';

var result = [];

var parser = new htmlparser.Parser({
    ontext: function(text){
        result.push(text);
    }
}, {decodeEntities: true});

parser.write(body);
parser.end();

result.join('');

The output will be This is a simple example.输出将是This is a simple example.

See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html在这里查看它的实际效果: https ://tonicdev.com/jfahrenkrug/extract-text-from-html

This works in both node and the browser if you pack your web application using a tool like webpack.如果您使用 webpack 之类的工具打包您的 Web 应用程序,这在节点和浏览器中都有效。

A lot of people have answered this already, but I thought it might be useful to share the function I wrote that strips HTML tags from a string but allows you to include an array of tags that you do not want stripped.很多人已经回答了这个问题,但我认为分享我编写的从字符串中剥离 HTML 标记但允许您包含不想被剥离的标记数组的函数可能会很有用。 It's pretty short and has been working nicely for me.它很短,对我来说效果很好。

function removeTags(string, array){
  return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
  function f(array, value){
    return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
  }
}

var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>

为了更简单的解决方案,试试这个=> https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");

After trying all of the answers mentioned most if not all of them had edge cases and couldn't completely support my needs.在尝试了所有提到的答案之后,如果不是全部的话,它们都有边缘情况并且不能完全支持我的需求。

I started exploring how php does it and came across the php.js lib which replicates the strip_tags method here: http://phpjs.org/functions/strip_tags/我开始探索 php 是如何做到的,并遇到了复制 strip_tags 方法的 php.js 库:http: //phpjs.org/functions/strip_tags/

function stripHTML(my_string){
    var charArr   = my_string.split(''),
        resultArr = [],
        htmlZone  = 0,
        quoteZone = 0;
    for( x=0; x < charArr.length; x++ ){
     switch( charArr[x] + htmlZone + quoteZone ){
       case "<00" : htmlZone  = 1;break;
       case ">10" : htmlZone  = 0;resultArr.push(' ');break;
       case '"10' : quoteZone = 1;break;
       case "'10" : quoteZone = 2;break;
       case '"11' : 
       case "'12" : quoteZone = 0;break;
       default    : if(!htmlZone){ resultArr.push(charArr[x]); }
     }
    }
    return resultArr.join('');
}

Accounts for > inside attributes and <img onerror="javascript"> in newly created dom elements.在新创建的 dom 元素中考虑 > 内部属性和<img onerror="javascript">

usage:用法:

clean_string = stripHTML("string with <html> in it")

demo:演示:

https://jsfiddle.net/gaby_de_wilde/pqayphzd/ https://jsfiddle.net/gaby_de_wilde/pqayphzd/

demo of top answer doing the terrible things:最佳答案的演示做可怕的事情:

https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/ https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/

I made some modifications to original Jibberboy2000 script Hope it'll be usefull for someone我对原始的 Jibberboy2000 脚本做了一些修改希望它对某人有用

str = '**ANY HTML CONTENT HERE**';

str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");

Here's a version which sorta addresses @MikeSamuel's security concern:这是一个解决@MikeSamuel 安全问题的版本:

function strip(html)
{
   try {
       var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
       doc.documentElement.innerHTML = html;
       return doc.documentElement.textContent||doc.documentElement.innerText;
   } catch(e) {
       return "";
   }
}

Note, it will return an empty string if the HTML markup isn't valid XML (aka, tags must be closed and attributes must be quoted).请注意,如果 HTML 标记不是有效的 XML,它将返回一个空字符串(也就是必须关闭标签并且必须引用属性)。 This isn't ideal, but does avoid the issue of having the security exploit potential.这并不理想,但确实避免了具有安全漏洞利用潜力的问题。

If not having valid XML markup is a requirement for you, you could try using:如果没有有效的 XML 标记是您的要求,您可以尝试使用:

var doc = document.implementation.createHTMLDocument("");

but that isn't a perfect solution either for other reasons.但由于其他原因,这也不是一个完美的解决方案。

You can safely strip html tags using the iframe sandbox attribute . 您可以使用iframe沙盒属性安全地删除html标签。

The idea here is that instead of trying to regex our string, we take advantage of the browser's native parser by injecting the text into a DOM element and then querying the textContent / innerText property of that element. 这里的想法是,我们不尝试对字符串进行正则表达式,而是利用浏览器的本机解析器,方法是将文本注入DOM元素,然后查询该textContent / innerText属性。

The best suited element in which to inject our text is a sandboxed iframe, that way we can prevent any arbitrary code execution (Also known as XSS ). 最适合插入文本的元素是沙盒iframe,这样我们就可以防止执行任意代码(也称为XSS )。

The downside of this approach is that it only works in browsers. 这种方法的缺点是仅在浏览器中有效。

Here's what I came up with (Not battle-tested): 这是我想出的(未经测试):

const stripHtmlTags = (() => {
  const sandbox = document.createElement("iframe");
  sandbox.sandbox = "allow-same-origin"; // <--- This is the key
  sandbox.style.setProperty("display", "none", "important");

  // Inject the sanbox in the current document
  document.body.appendChild(sandbox);

  // Get the sandbox's context
  const sanboxContext = sandbox.contentWindow.document;

  return (untrustedString) => {
    if (typeof untrustedString !== "string") return ""; 

    // Write the untrusted string in the iframe's body
    sanboxContext.open();
    sanboxContext.write(untrustedString);
    sanboxContext.close();

    // Get the string without html
    return sanboxContext.body.textContent || sanboxContext.body.innerText || "";
  };
})();

Usage ( demo ): 用法( 演示 ):

console.log(stripHtmlTags(`<img onerror='alert("could run arbitrary JS here")' src='bogus'>XSS injection :)`));
console.log(stripHtmlTags(`<script>alert("awdawd");</` + `script>Script tag injection :)`));
console.log(stripHtmlTags(`<strong>I am bold text</strong>`));
console.log(stripHtmlTags(`<html>I'm a HTML tag</html>`));
console.log(stripHtmlTags(`<body>I'm a body tag</body>`));
console.log(stripHtmlTags(`<head>I'm a head tag</head>`));
console.log(stripHtmlTags(null));

I just needed to strip out the <a> tags and replace them with the text of the link.我只需要<a>标签并将它们替换为链接的文本。

This seems to work great.这似乎工作得很好。

htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');

Below code allows you to retain some html tags while stripping all others下面的代码允许您保留一些 html 标签,同时剥离所有其他标签

function strip_tags(input, allowed) {

  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)

  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;

  return input.replace(commentsAndPhpTags, '')
      .replace(tags, function($0, $1) {
          return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
}

The accepted answer works fine mostly, however in IE if the html string is null you get the "null" (instead of '').接受的答案大部分都可以正常工作,但是在 IE 中,如果html字符串为null ,您会得到"null" (而不是 '')。 Fixed:固定的:

function strip(html)
{
   if (html == null) return "";
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

I think the easiest way is to just use Regular Expressions as someone mentioned above.我认为最简单的方法就是像上面提到的那样使用正则表达式。 Although there's no reason to use a bunch of them.尽管没有理由使用一堆。 Try:尝试:

stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");

A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.使用 jQuery 剥离 html 的一种更安全的方法是首先使用jQuery.parseHTML创建一个 DOM,忽略任何脚本,然后让 jQuery 构建一个元素,然后只检索文本。

function stripHtml(unsafe) {
    return $($.parseHTML(unsafe)).text();
}

Can safely strip html from:可以安全地从以下位置剥离 html:

<img src="unknown.gif" onerror="console.log('running injections');">

And other exploits.和其他漏洞。

nJoy!开心!

使用 jQuery,您可以简单地使用

$('#elementID').text()

我自己创建了一个工作正则表达式:

str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, ''); 

simple 2 line jquery to strip the html.简单的 2 行 jquery 来剥离 html。

 var content = "<p>checking the html source&nbsp;</p><p>&nbsp;
  </p><p>with&nbsp;</p><p>all</p><p>the html&nbsp;</p><p>content</p>";

 var text = $(content).text();//It gets you the plain text
 console.log(text);//check the data in your console

 cj("#text_area_id").val(text);//set your content to text area using text_area_id

Using Jquery:使用jQuery:

function stripTags() {
    return $('<p></p>').html(textToEscape).text()
}

input element support only one line text : input元素仅支持一行文本

The text state represents a one line plain text edit control for the element's value.文本状态表示元素值的单行纯文本编辑控件。

function stripHtml(str) {
  var tmp = document.createElement('input');
  tmp.value = str;
  return tmp.value;
}

Update: this works as expected更新:这按预期工作

function stripHtml(str) {
  // Remove some tags
  str = str.replace(/<[^>]+>/gim, '');

  // Remove BB code
  str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');

  // Remove html and line breaks
  const div = document.createElement('div');
  div.innerHTML = str;

  const input = document.createElement('input');
  input.value = div.textContent || div.innerText || '';

  return input.value;
}

If you don't want to create a DOM for this (perhaps you're not in a browser context) you could use the striptags npm package.如果您不想为此创建 DOM(也许您不在浏览器上下文中),您可以使用striptags npm 包。

import striptags from 'striptags'; //ES6 <-- pick one
const striptags = require('striptags'); //ES5 <-- pick one

striptags('<p>An HTML string</p>');
const getTextFromHtml = (t) =>
  t
    ?.split('>')
    ?.map((i) => i.split('<')[0])
    .filter((i) => !i.includes('=') && i.trim())
    .join('');

const test = '<p>This <strong>one</strong> <em>time</em>,</p><br /><blockquote>I went to</blockquote><ul><li>band <a href="https://workingclasshistory.com" rel="noopener noreferrer" target="_blank">camp</a>…</li></ul><p>I edited this as a reviewer just to double check</p>'

getTextFromHtml(test)
  // 'This onetime,I went toband camp…I edited this as a reviewer just to double check'
const strip=(text) =>{
    return (new DOMParser()?.parseFromString(text,"text/html"))
    ?.body?.textContent
}

const value=document.getElementById("idOfEl").value

const cleanText=strip(value)

对于转义字符,这也可以使用模式匹配:

myString.replace(/((&lt)|(<)(?:.|\n)*?(&gt)|(>))/gm, '');

https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML

var div = document.getElementsByTagName('div');
for (var i=0; i<div.length; i++) {
    div[i].insertAdjacentHTML('afterend', div[i].innerHTML);
    document.body.removeChild(div[i]);
}

method 1:方法一:

function cleanHTML(str){
  str.replace(/<(?<=<)(.*?)(?=>)>/g, '&lt;$1&gt;');
}

function uncleanHTML(str){
  str.replace(/&lt;(?<=&lt;)(.*?)(?=&gt;)&gt;/g, '<$1>');
}

method 2:方法二:

function cleanHTML(str){
  str.replace(/</g, '&lt;').replace(/>/g, '&gt;');
}

function uncleanHTML(str){
  str.replace(/&lt;/g, '<').replace(/&gt;/g, '>');
}

also, don't forget if the user happens to post a math comment (ex: 1 < 2) , you don't want to strip the whole comment.另外,不要忘记如果用户碰巧发表了数学评论(ex: 1 < 2) ,您不想删除整个评论。 The browser (only tested chrome) doesn't run unicode as html tags.浏览器(仅经过测试的 chrome)不会将 unicode 作为 html 标签运行。 if you replace all < with &lt;如果您将所有<替换为&lt; everyware in the string, the unicode will display < as text without running any html.字符串中的每个软件,unicode 将显示<作为文本而不运行任何 html。 I recommend method 2. jquery also works well $('#element').text();我推荐方法2。jquery也很好用$('#element').text();

var STR='<Your HTML STRING>''
var HTMLParsedText="";
   var resultSet =  STR.split('>')
   var resultSetLength =resultSet.length
   var counter=0
   while(resultSetLength>0)
   {
      if(resultSet[counter].indexOf('<')>0)
      {    
        var value = resultSet[counter];
        value=value.substring(0, resultSet[counter].indexOf('<'))
        if (resultSet[counter].indexOf('&')>=0 && resultSet[counter].indexOf(';')>=0) {
            value=value.replace(value.substring(resultSet[counter].indexOf('&'), resultSet[counter].indexOf(';')+1),'')
        }
      }
        if (value)
        {
          value = value.trim();
          if(HTMLParsedText === "")
          {
              HTMLParsedText = value;
          }
          else
          {
            if (value) {
              HTMLParsedText = HTMLParsedText + "\n" + value;
            }
          }
          value='';
        }
        counter= counter+1;
      resultSetLength=resultSetLength-1;
   }
  console.log(HTMLParsedText);

This package works really well for stripping HTML: https://www.npmjs.com/package/string-strip-html这个包非常适合剥离 HTML: https ://www.npmjs.com/package/string-strip-html

It works in both the browser and on the server (eg Node.js).它适用于浏览器和服务器(例如 Node.js)。

As others suggested, I recommend using DOMParser when possible.正如其他人建议的那样,我建议尽可能使用DOMParser

However, if you happen to be working inside a Node / JS Lambda or otherwise DOMParser is not available, I came up with the regex below to match most of the scenarios mentioned in previous answers/comments.但是,如果您碰巧在Node / JS Lambda 内部工作,或者DOMParser不可用,我想出了下面的正则表达式来匹配之前的答案/评论中提到的大多数场景。 It doesn't match $gt;它不匹配$gt; and $lt;$lt; as some others may have a concern about, but should capture pretty much any other scenario.正如其他一些人可能会担心的那样,但应该捕捉到几乎任何其他情况。

const dangerousText = '?';
const htmlTagRegex = /<\/?([a-zA-Z]\s?)*?([a-zA-Z]+?=\s?".*")*?([\s/]*?)>/gi;
const sanitizedText = dangerousText.replace(htmlTagRegex, '');

This might be easy to simplify, but it should work for most situations.这可能很容易简化,但它应该适用于大多数情况。 Hope it helps someone.希望它可以帮助某人。

 const htmlParser= new DOMParser().parseFromString("<h6>User<p>name</p></h6>" , 'text/html'); const textString= htmlParser.body.textContent; console.log(textString)

You can strip out all the html tags with the following regex: /<(.|\n)*?>/g您可以使用以下正则表达式去除所有html标签:/<(.|\n)*?>/g

Example:例子:

let str = "<font class=\"ClsName\">int[0]</font><font class=\"StrLit\">()</font>";
console.log(str.replace(/<(.|\n)*?>/g, ''));

Output: Output:

int[0]()

A very good library would be sanitize-html which is a pure JavaScript function and it could help in any environment.一个非常好的库是sanitize-html ,它是一个纯 JavaScript function,它可以在任何环境中提供帮助。

My case was on React Native I needed to remove all HTML tags from the given texts.我的案例是关于 React Native 我需要从给定的文本中删除所有 HTML 标签。 so I created this wrapper function:所以我创建了这个包装器 function:

import sanitizer from 'sanitize-html';

const textSanitizer = (textWithHTML: string): string =>
  sanitizer(textWithHTML, {
    allowedTags: [],
  });

export default textSanitizer;

Now by using my textSanitizer , I can have got the pure text contents.现在通过使用我的textSanitizer ,我可以获得纯文本内容。

    (function($){
        $.html2text = function(html) {
            if($('#scratch_pad').length === 0) {
                $('<div id="lh_scratch"></div>').appendTo('body');  
            }
            return $('#scratch_pad').html(html).text();
        };

    })(jQuery);

Define this as a jquery plugin and use it like as follows:将其定义为 jquery 插件并按如下方式使用它:

$.html2text(htmlContent);
function strip_html_tags(str)
{
   if ((str===null) || (str===''))
       return false;
  else
   str = str.toString();
  return str.replace(/<[^>]*>/g, '');
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM