简体   繁体   English

匹配最小可能的句子

[英]Match smallest possible sentence

Text:文本:

One sentence here, much wow. Another one here. This is O.N.E. example n. 1, a nice one to understand. Hope it's clear now!

Regex: (?<=\.\s)[AZ].+?nice one.+?\.(?=\s[AZ])正则表达式: (?<=\.\s)[AZ].+?nice one.+?\.(?=\s[AZ])

Result: Another one here. This is ONE example n. 1, a nice one to understand.结果: Another one here. This is ONE example n. 1, a nice one to understand. Another one here. This is ONE example n. 1, a nice one to understand.

How can I do to obtain This is ONE example among n. 1, a nice one to understand.我该怎么做才能获得This is ONE example among n. 1, a nice one to understand. This is ONE example among n. 1, a nice one to understand. ? ? (ie the smallest possible sentence that matches the regex) (即与正则表达式匹配的最小可能句子)

Just insert a greedy .* in front of the expression只需在表达式前面插入一个贪婪的.*

.*\.\s([A-Z].+?nice one.+?\.(?=\s[A-Z]))

Here is a little bit of a different approach just splitting the entire text and then filtering out what you are after:这是一种不同的方法,只是拆分整个文本,然后过滤掉您所追求的内容:

import re
s = "One sentence here, much wow. Another one here. This is O.N.E. example n. 1, a nice one to understand. Hope it's clear now!"
result = [x for x in re.split(r'(?<=\B.\.)\s*',s) if 'nice one' in x][0]
print(result) # This is O.N.E. example n. 1, a nice one to understand.

Not sure how many edge-cases you have got but here I used re.split() with the following pattern: (?<=\B.\.)\s* .不确定你有多少边缘情况,但在这里我使用了re.split()和以下模式: (?<=\B.\.)\s* This would mean:这意味着:

  • (?<=\B.\.) - A positive lookbehind to assert position is after a position where \b (a word-boundary) does not apply, followed by a literal dot. (?<=\B.\.) - 断言 position 在 position 之后的肯定回溯,其中\b (字边界)适用,后跟文字点。
  • \s* - 0+ Whitespace characters. \s* - 0+ 个空白字符。

With the resulting array it won't be too much problem to check which element is holding your desired words "nice one".使用生成的数组,检查哪个元素包含您想要的单词“nice one”不会有太大问题。

See an online demo查看在线演示

You could exclude matching a dot, and only match the dot incase of an uppercase char followed by a dot, or a dot followed by a space and digit.您可以排除匹配点,并且仅匹配大写字符后跟点或点后跟空格和数字的点。

(?:(?<=\.\s)|^)[A-Z][^.A-Z]*(?:(?:[A-Z]\.|\.\s\d)[^.A-Z]*)*\bnice one\b.+?(?=\s[A-Z])
  • (?:(?<=\.\s)|^) Assert a . (?:(?<=\.\s)|^)断言. and whitespace char to the left or the start of the string和左边的空白字符或字符串的开头
  • [AZ][^.AZ]* Match an uppercase char AZ and 0+ times any char except a dot or uppercase char [AZ][^.AZ]*匹配大写字符 AZ 和 0+ 次除点或大写字符外的任何字符
  • (?: Non capture group (?:非捕获组
    • (?:[AZ]\.|\.\s\d) Match either AZ and . (?:[AZ]\.|\.\s\d)匹配 AZ 和. or match .或匹配. whitespace char and digit空格字符和数字
    • [^.AZ]* Optionally match any char except a . [^.AZ]*可选匹配除 a 之外的任何字符. or uppercase char或大写字符
  • )* Close group and optionally repeat )*关闭组并可选择重复
  • \bnice one\b.+?(?=\s[AZ]) Match nice one and match until asserting a whitspace char and uppercase char to the right \bnice one\b.+?(?=\s[AZ])匹配nice one并匹配,直到在右边断言一个空白字符和大写字符

Regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM