简体   繁体   English

为特定句型创建 Python 正则表达式

[英]Create Python regex for specific sentence pattern

I'm trying to build a regex pattern that can capture the following examples:我正在尝试构建一个可以捕获以下示例的正则表达式模式:

pattern1 = '.She is greatThis is annoyingWhy u do this' 
pattern2 = '.Weirdly specificThis sentence is longer than the other oneSee this is great'

example = 'He went such dare good mr fact. The small own seven saved man age no offer. Suspicion did mrs nor furniture smallness. Scale whole downs often leave not eat. An expression reasonably cultivated indulgence mr he surrounded instrument. Gentleman eat and consisted are pronounce distrusts.This is where the fun startsSummer is really bothersome this yearShe is out of ideas'

example_pattern_goal = 'This is where the fun startsSummer is really bothersome this yearShe is out of ideas'

Essentially, it's always a dot followed by sentences of various length not including any numbers.本质上,它总是一个点,后面跟着不包括任何数字的各种长度的句子。 I only want to capture these specific sentences, so I tried to capture instances where a dot was immediately followed by a word that starts with an uppercase and other words that include two instances where an uppercase letter is inside the word.我只想捕捉这些特定的句子,所以我试图捕捉一个点后紧跟一个以大写开头的单词和其他单词的实例,其中包括两个在单词中包含大写字母的实例。

So far, I've only come up with the following regex that doesn't quite work:到目前为止,我只提出了以下不太有效的正则表达式:

'.\b[A-Z]\w+[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+'

You can use您可以使用

\.([A-Z][a-z]*(?:\s+[A-Za-z]+)*\s+[a-zA-Z]+[A-Z][a-z]+(?:\s+[A-Za-z]+)*)

See the regex demo .请参阅正则表达式演示

Details :详情

  • \. - a dot - 一个点
  • [AZ][az]* - an ASCII word starting from an upper case letter [AZ][az]* - 一个以大写字母开头的 ASCII 字
  • (?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word (?:\s+[A-Za-z]+)* - 一个或多个空格的零个或多个序列,然后是一个 ASCII 字
  • \s+ - zero or more whitespaces \s+ - 零个或多个空格
  • [a-zA-Z]+[AZ][az]+ - an ASCII word with an uppercase letter inside it [a-zA-Z]+[AZ][az]+ - 一个包含大写字母的 ASCII 单词
  • (?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word. (?:\s+[A-Za-z]+)* - 一个或多个空格的零个或多个序列,然后是一个 ASCII 字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM