简体   繁体   English

正则表达式以使用Python匹配某些句子模式

[英]Regex to match certain sentence pattern with Python

I'm trying to find if a particular sentence pattern has an abbreviated word like REM or CEO. 我正在尝试查找特定的句子模式是否包含REM或CEO之类的缩写词。 An abbreviated words that I am looking for is words with capital letters punctuated with period like REM or all caps. 我要查找的缩写词是带有大写字母的单词,并带有点号,例如REM或全部大写。

#sentence pattern = 'What is/was a/an(optional) word(abbreviated or not) ?
sentence1 = 'What is a CEO'
sentence2 = 'What is a geisha?'
sentence3 = 'What is ``R.E.M.``?'

This is what I have but it's not returning anything at all. 这就是我所拥有的,但根本不返回任何东西。 It doesn't recognise the pattern. 它无法识别模式。 I can't figure out what is wrong with the regex. 我不知道正则表达式有什么问题。

c5 = re.compile("^[w|W]hat (is|are|was|were|\'s)( a| an| the)*( \`\`)*( [A-Z\.]+\s)*( \'\')* \?$")
if c5.match(question):
    return "True."

EDIT: I am looking to see if the sentence pattern above has an abbreviated word. 编辑:我正在寻找是否上面的句子模式有一个缩写词。

You've got a few issues. 您遇到了一些问题。 It's not really clear from your examples what sort of quoting might be expected, or if you want to match the ones that don't end in question marks. 从您的示例中并不清楚是应该使用哪种引号,或者是否要匹配未以问号结尾的引号。 Your regex uses * (zero or any number of the previous) when I think you can use ? 当我认为您可以使用?时,您的正则表达式使用* (零或以前的任意数字) ? (zero or one of the previous). (零或前一个)。 You also will miss sentences with What's even though I think you want those, because you're looking for What 's instead. 即使我认为您想要的是What's您也会错过“ What's句子,因为您正在寻找“ What 's

Here's a possible solution: 这是一个可能的解决方案:

 import re
 sentence1 = "What is a CEO"
 sentence2 = "What is a geisha?"
 sentence3 = "What is ``R.E.M.``?"
 sentence4 = "What's SCUBA?"

 c1 = re.compile(r"^[wW]hat(?: is| are| was| were|\'s)(?: a| an| the)? [`']{0,2}((?:[A-Z]\.)+|[A-Z]+)[`']{0,2} ?\??")

 def test(question, regex):
     if regex.match(question):
         return "Matched!"
     else:
         return "Nope!"

 test(sentence1,c1)
 > "Matched!"
 test(sentence2,c1)
 > "Nope!"
 test(sentence3,c1)
 > "Matched!"
 test(sentence4,c1)
 > "Matched!"     

But it could probably be tweaked more depending on whether you expect the abbreviation to be double-quoted, for example. 但是,也许可以根据您是否希望将缩写双引号来进行更多调整。

The position of the spaces before and after your abbreviation check are off. 缩写检查前后的空格位置已关闭。

You might also want to check your quote handling. 您可能还需要检查报价处理。 Perhaps it's just an artefact of posting your code here, but there seems to be some confusion with your ' and `'s. 也许这只是在此处发布代码的人工产物,但似乎与您的'和`混淆。 Try 尝试

['`"]*

instead for both. 代替两者。

You can try this pattern: 您可以尝试以下模式:

c5 = re.compile(r"^[wW]hat (?:is|are|w(?:as|ere)|'s)(?: (?:an?|the))? ([`'\"]*)((?:[A-Z]\.)+|[A-Z]+)\1 ?\??$")

explanations: 解释:

I use non capturing groups (?:..) instead of of capturing groups (..) assuming that you don't need to extract what there is inside (except for the abbreviation). 我使用非捕获组(?:..)而不是捕获组(..)前提是您不需要提取其中的内容(缩写除外)。

[w|W] is replaced by [wW] since | [w|W][wW]代替,因为| in a character class is seen as literal. 在字符类中被视为文字。

To make the different quotes optional around the abbreviation, I use a capture group before (that can be void): ([`'\\"]*) and I use a backreference after the abbreviation (ie: \\1 ) 为了使不同的引号引起来的缩写是可选的,我在(可以为空)之前使用捕获组: ([`'\\"]*) ,在缩写后使用反向引用(即: \\1

The abbreviation is described as an alternation between (?:[AZ]\\.)+ (uppercase letter with a dot) or just uppercase [AZ] . 缩写被描述为(?:[AZ]\\.)+ (带点的大写字母)或仅大写[AZ]之间的交替。

I allow no space between the abbreviation and the question mark (that is optional too now, thanks to FooBar for these notices) by making the space optional. 通过使空格为可选,我在缩写和问号之间不留任何空格(由于FooBar的这些通知,这也是可选的)。

This should work: 这应该工作:

re.compile("^[wW]hat (is|are|was|were) ((a|an|the) )*(['"`]*)([A-Z\.]*)(['"`]*)\?$")

You can make some/all of the groups non-capturing if necessary, or you can make the terminating question mark optional (I noticed it's missing from one of your examples). 如有必要,您可以使部分/全部组不被捕获,或者可以将终止问号设为可选(我注意到您的示例之一中缺少了该问号)。 There are a few tweaks that could be made here and there, but this pretty much does it. 可以在此处和此处进行一些调整,但这几乎可以做到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM