简体   繁体   English

使用 REGEX 'Lookbehind' 作为可选将长字符串拆分为块

[英]Split a long string into chunks using REGEX 'Lookbehind' as optional

I'm working on a regex that lets me split into chunks a long text that could have #variables# inside.我正在研究一个正则表达式,它可以让我将一个长文本分成几块,里面可能有#variables#。 The rules to do the splitting basically are:进行拆分的规则基本上是:

  1. Split by each #photo# or #childphoto# variable and look behind or ahead for text to don't cut the sentence.按每个 #photo# 或 #childphoto# 变量拆分,然后查看后面或前面的文本以不切断句子。
  2. Each chunk should have only one #photo# or #childphoto# variable, or not have any of these variables每个块应该只有一个 #photo# 或 #childphoto# 变量,或者没有这些变量中的任何一个
  3. Also, the chunk should be less than 350 characters此外,块应少于 350 个字符
  4. The chunk should not have to cut words or sentences块不应该削减单词或句子
  5. The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#该块不应该将任何可能的文本变量剪切到文本#anyOtherVariables#

Currently, I have this Regex目前,我有这个正则表达式

/^.*[\S\s]{0,350}[\s\S](?<=(#photo#|#childphoto#)).*/

That currently is working with the .match() JavaScript method to extract the chunks of text that have the variables using the 'look behind' approach, but is not working with the other chunks that do not match the 'look behind' condition, is there a way to include the other parts?目前正在使用.match() JavaScript 方法来提取具有使用“后视”方法的变量的文本块,但不适用于与“后视”条件不匹配的其他文本块,是有没有办法包括其他部分?

There are the regexp and the study test case.有正则表达式和研究测试用例。 https://regex101.com/r/kdKHkQ/1 https://regex101.com/r/kdKHkQ/1

I will really appreciate any help with that.我真的很感激任何帮助。

Here is a single JavaScript regex that does what you have specified:这是一个执行您指定的操作的 JavaScript 正则表达式:

^\\b(?=([^]*))[^]{0,350}$(?<=(?![^]{1,}\\1$)(?:#(photo|childphoto)#)?[^]*?)(?<!(?=\\1$)(?:[^]*?#(photo|childphoto)#){2}[^]*?)

Demo on regex101 regex101 上的演示

It enforces the 350 character limit by taking a snapshot (using lookahead) at the beginning, consuming and capturing up to 350 characters, and then using a lookbehind to look no further back than the snapshotted beginning, to assert that one of the variables in question is inside the just-captured string.它通过在开始服用(使用超前)的快照,消费和捕捉高达350个字符,然后用回顾后看上去比快照开始不再回来,主张强制执行350个字符的限制,有关的变量之一在刚刚捕获的字符串内。 Then it uses a negative lookbehind to enforce that there are not two or more of the variables in question in the just-captured string.然后它使用负向后视来强制在刚捕获的字符串中没有两个或多个相关变量。

I did not understand your rule "The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#".我不明白你的规则“块不应该将任何可能的文本变量剪切到文本#anyOtherVariables#”。 If by that you mean that the lines containing variables other than #photo# or #childphoto# should be skipped over (not matched), then this regex does not do that, but it could be easily modified to do so.如果您的意思是应该跳过包含#photo# 或#childphoto# 以外的变量的行(不匹配),那么这个正则表达式不会这样做,但可以很容易地修改它。

Now, practically speaking, it would probably be better to implement this in code, or a combination of code and regex, but this demonstrates that exactly what you asked is possible with a pure regex.现在,实际上,在代码中或代码和正则表达式的组合中实现它可能会更好,但这表明使用纯正则表达式完全可以实现您的要求。

I would like to point out that calling this "splitting by each #photo# or #childphoto# variable" is disingenous, and if I actually took that literally, it would be breaking your other rule, that the chunk should not cut sentences.我想指出,称这种“按每个 #photo# 或 #childphoto# 变量拆分”是不诚实的,如果我真的从字面上理解,那将违反您的另一条规则,即该块不应该削减句子。 That is probably why you got downvoted.这可能就是你被否决的原因。

I'm posting my answer here, despite the fact that you got downvoted, because I already answered this on reddit and you disappeared without commenting.我在这里发布了我的答案,尽管您被否决了,因为我已经在 reddit 上回答了这个问题,而您没有发表评论就消失了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM