简体   繁体   English

正则表达式匹配内容直到多字符串

[英]Regular expression to match content until multi-character string

I've got defective input coming in that looks like this...我的输入有缺陷,看起来像这样......

foo<p>bar</p>

And I want to normalize it to wrap the leading text in ap tag:我想对其进行规范化以将前导文本包装在 ap 标签中:

<p>foo</p><p>bar</p>

This is easy enough with the regex replace of /^([^<]+)/ with <p>$1</p> .这很容易使用/^([^<]+)/的正则表达式替换<p>$1</p> Problem is, sometimes the leading chunk contains tags other than p, like so:问题是,有时前导块包含 p 以外的标签,如下所示:

foo <b>bold</b><p>bar</p>

This should wrap the whole chunk in a new p:这应该将整个块包装在一个新的 p 中:

<p>foo <b>bold</b></p><p>bar</p>

But since the simple regex looks only for < , it stops at <b> and spits out:但是由于简单的正则表达式只查找< ,它在<b>处停止并吐出:

<p>foo </p><b>bold</b><p>bar</p> <!-- oops -->

So how do I rewrite the regex to match <p ?那么如何重写正则表达式以匹配<p Apparently the answer involves negative lookahead, but this is a bit too deep for me.显然答案涉及消极的前瞻,但这对我来说有点太深了。

(And before the inevitable "you can't parse HTML with regexes," comment, the input is not random HTML, but plain text annotated with only the tags <p> , <a> , <b> and <i> , and a/b/i may not be nested.) (在不可避免的“你不能用正则表达式解析 HTML”之前,评论说,输入不是随机的 HTML,而是仅用标签<p><a><b><i>注释的纯文本,以及a/b/i 不能嵌套。)

I think you actually want positive lookahead.我认为你实际上想要积极的前瞻。 It's really not bad:这真的不错:

/^([^<]+)(?=<p)/

You just want to make sure that whatever comes after < is p , but you don't want to actually consume <p , so you use a lookahead.您只想确保<之后的任何内容都是p ,但您不想实际使用<p ,因此您使用前瞻。

Examples:例子:

> var re = /^([^<]+)(?=<p)/g;

> 'foo<p>bar</p>'.replace(re, '<p>$1</p>');
  "<p>foo</p><p>bar</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re, '<p>$1</p>')
  "foo <b>bold</b><p>bar</p>"

Sorry, wasn't clear enough in my original posting: my expectation was that the "foo bold" bit would also get wrapped in a new p tag, and that's not happening.抱歉,在我的原始帖子中不够清楚:我的期望是“foo bold”位也会被包裹在一个新的p标签中,但这并没有发生。

Also, every now and then there's input with no p tags at all (just plain foo ), and that should also map to <p>foo</p> .此外,有时输入根本没有p标签(只是普通的foo ),这也应该 map 到<p>foo</p>

The easiest way I found to get this working is to use 2 separate regexps, /^(.+?(?=<p))/ and /^([^<]+)/ .我发现最简单的方法是使用 2 个单独的正则/^(.+?(?=<p))//^([^<]+)/

> var re1 = /^(.+?(?=<p))/g,
      re2 = /^([^<]+)/g,
      s = '<p>$1</p>';

> 'foo<p>bar</p>'.replace(re1, s).replace(re2, s);
  "<p>foo</p><p>bar</p>"

> 'foo'.replace(re1, s).replace(re2, s);
  "<p>foo</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re1, s).replace(re2, s);
  "<p>foo <b>bold</b></p><p>bar</p>"

It's possible to write a single, equivalent regexp by combining re1 and re2 :通过组合re1re2可以编写一个等效的正则表达式:
/^(.+?(?=<p)|[^<]+)/

> var re3 = /^(.+?(?=<p)|[^<]+)/g,
      s = '<p>$1</p>';

> 'foo<p>bar</p>'.replace(re3, s)
  "<p>foo</p><p>bar</p>"

> 'foo'.replace(re3, s)
  "<p>foo</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re3, s)
  "<p>foo <b>bold</b></p><p>bar</p>"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM