简体   繁体   English

用于删除无效 HTML 标记之间空格的正则表达式 - 即“</b>”应为“”

[英]Regex to remove spaces between invalid HTML tags - i.e. "< / b >" should be "</b>"

I have some HTML that is all mangled with the spaces within the tags and wants to make it valid again - for example:我有一些 HTML 全部被标签中的空格弄乱了,我想让它再次有效——例如:

< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div >

Should be converted to valid HTML and when rendered, it would expectedly produce:应该转换为有效的 HTML 并且在呈现时,它会产生:

 <div class='test'>1 > 0 is <b>true</b> and apples >>> bananas</div>

Any > or < preceded/followed by spaces in the text should be left unchanged - for example, 1 > 0 should remain, rather than being squashed to 1>0文本中前面/后面有空格的任何><都应保持不变 - 例如, 1 > 0应保留,而不是被压缩为1>0

I realize this will probably take a couple of regex expressions, which is fine我意识到这可能需要几个正则表达式,这很好

I have a few things:我有几件事:

<\s?\/\s* which will partially fix </ b>< / div > to </b></div > , but am struggling with the rest <\s?\/\s*这将部分修复</ b>< / div ></b></div > ,但我正在努力解决其余问题

For example, I could go with a heavy-handed approach, but this will also break code within the text parts of the tags, rather than the tag names themselves例如,我可以采用严厉的方法,但这也会破坏标签文本部分的代码,而不是标签名称本身

There's no reasonable way to save a document as corrupt as what you've posted, but assuming you replace the > and similar characters in the text the their relevant entities, eg: &gt;没有合理的方法可以将文档保存为与您发布的内容一样损坏,但假设您将文本中的>和类似字符替换为它们的相关实体,例如: &gt; , you can massage the document to be accepted into a proper library like DomDocument which will handle the rest. ,您可以将要接受的文档按摩到适当的库中,例如DomDocument ,它将处理其余部分。

$input = <<<_E_
< div class='test' >1 &gt; 0 is < b >true</ b> and apples &gt;&gt;&gt; bananas< / div >
_E_;

$input = preg_replace([ '#<\s+#', '#</\s+#' ], [ '<', '</' ], $input);

$d = new DomDocument();
$d->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

var_dump($d->saveHTML());

Output:输出:

string(80) "<div class="test">1 &gt; 0 is <b>true</b> and apples &gt;&gt;&gt; bananas</div>"

This regex works too:这个正则表达式也有效:

It captures valid sections in an HTML tag in four parts and replaces the rest (spaces) with that.它捕获 HTML 标记中的四个部分的有效部分,并用它替换其余部分(空格)。

Regex101 Demo Regex101 演示

/(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g

  • (<) - capture starting angular bracket (section 1) (<) - 捕获起始尖括号(第 1 部分)
  • \s* - match any spaces \s* - 匹配任何空格
  • (\/?) - capture optional backward slash (section 2) (\/?) - 捕获可选的反斜杠(第 2 节)
  • \s* - match any spaces after the backward slash \s* - 匹配反斜杠后的任何空格
  • ([^<>]*\S) - capture the content inside the tag without the trailing spaces (section 3) ([^<>]*\S) - 捕获标签内没有尾随空格的内容(第 3 节)
  • \s* - match spaces after the content and before the closing angular bracket \s* - 匹配内容之后和右尖括号之前的空格
  • (>) - capture the closing angular bracket (section 4) (>) - 捕获右尖括号(第 4 节)

 const reg = /(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g const str = "< div class='test' >1 > 0 is < b >true< / b > and apples >>> bananas< / div >" const newStr = str.replace(reg, "$1$2$3$4"); console.log(newStr);

You can use a couple of .replace() s with a RegEx and a custom replace callback:您可以将几个.replace()与 RegEx 和自定义替换回调一起使用:

 let s = `< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div >`; s = s.replace(/<.*?>/g, m => m.replaceAll(' ', '').replace(m.match(/[a-zA-Z]+/)[0], tagName => tagName + ' ').replace(' >', '>') ); console.log(s);

Here's a breakdown of the RegExs:这是 RegEx 的细分:

  1. s.replace(/<.*?>/g, /* arrow function */)

This will run the long arrow function as the custom replacer function for everything inside of the < and > brackets.这将为<>括号内的所有内容运行长箭头函数作为自定义替换函数。 This way, the replacement will only affect inside the tags.这样,替换只会影响标签内部。 The arrow function takes one parameter, m , which is the original text, and returns text to replace it with.箭头函数接受一个参数m ,即原始文本,并返回替换它的文本。

  1. m.replaceAll(' ', '')

Removes all spaces in the string.删除字符串中的所有空格。 This will also remove spaces between the tag name and the attributes, so we need step 3.这也将删除标签名称和属性之间的空格,因此我们需要第 3 步。

  1. .replace(m.match(/[a-zA-Z]+/)[0], tagName => tagName + ' ')

This takes the result of step 2 and adds a space after each tag name.这采用步骤 2 的结果并在每个标签名称后添加一个空格。 m.match(/[a-zA-Z]+/)[0] will be the tag name because m still contains the original text before step 2. m.match(/[a-zA-Z]+/)[0]将是标签名称,因为m仍然包含步骤 2 之前的原始文本。

  1. .replace(' >', '>')

This will get the last edge case where there were no attributes or the tag was an ending tag so step 3 actually added an unnecessary space.这将得到没有属性或标签是结束标签的最后一个边缘情况,因此第 3 步实际上添加了一个不必要的空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM