简体   繁体   English

正则表达式以匹配仅包含3个或更少大写单词的引号中的字符串

[英]Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes. 我已经搜索了很多,但找不到正则表达式问题的任何缓解方法。

I wrote the following dummy sentence: 我写了下面的假句子:

Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. 观看小乔·史密斯(Joe Smith Jr.)和索尔(Canul)阿尔瓦雷斯(Saul“ Canelo” Alvarez)为WBO腰带GGG与Oscar de la Hoya和Genaddy Triple-G Golovkin的对抗。 Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. 卡内洛·阿尔瓦雷斯(Canelo Alvarez)和弗洛伊德(Moy)梅威瑟(Mayweather)在新泽西州大西洋城展开战斗。 Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. Conor MacGregor将与Adonis超人Stevenson和Sugar Ray Robinson先生一同出席。 "Here Goes a String". “这是弦乐”。 'Money Mayweather'. '钱梅威瑟'。 "this is not a-string", "this is not A string", "This IS a" "Three Word String". “这不是字符串”,“这不是字符串”,“这是一个”“三字字符串”。

I'm looking for a regular expression that will return the following when used in Python 3.6: 我正在寻找在Python 3.6中使用时将返回以下内容的正则表达式:

Canelo, Money, Money Mayweather, Three Word String Canelo,钱,钱梅威瑟,三字串

The regex that has gotten me the closest is: 使我最接近的正则表达式是:

(["'])[A-Z](\\?.)*?\1

I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. 我希望它仅匹配3个大写字母或更少的字符串,并立即用单引号或双引号引起来。 Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter. 不幸的是,到目前为止,无论长度如何,内容如何,​​引号似乎都可以匹配任何字符串,只要它以大写字母开头即可。

I've put a lot of time into trying to hack through it myself, but I've hit a wall. 我花了很多时间亲自尝试破解它,但是我碰壁了。 Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here? 拥有更强的正则表达式功夫的人可以给我一个我在哪里错的想法吗?

Try to use this one: (["'])((?:[AZ][az]+ ?){1,3})\\1 尝试使用此命令: (["'])((?:[AZ][az]+ ?){1,3})\\1

(["']) - opening quote (["']) -开头报价

([AZ][az]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space ([AZ][az]+ ?){1,3} -大写单词重复1到3次,以空格分隔

[AZ] - capital char (word begining char) [AZ]-大写字符(单词开头的字符)

[az]+ - non-capital chars (end of word) [az] +-非大写字符(字尾)

_? _? - space separator of capitalized words ( _ is a space), ? -大写单词的空格分隔符( _是一个空格) ? for single word w/o ending space 没有结束空间的单个单词

{1,3} - 1 to 3 times {1,3}-1至3次

\\1 - closing quote, same as opening \\1结束报价,与开始相同

Group 2 is what you want. 第2组是您想要的。

Match 1
Full match  29-37   `"Canelo"`
Group 1.    29-30   `"`
Group 2.    30-36   `Canelo`
Match 2
Full match  146-153 `'Money'`
Group 1.    146-147 `'`
Group 2.    147-152 `Money`
Match 3
Full match  318-336 `'Money Mayweather'`
Group 1.    318-319 `'`
Group 2.    319-335 `Money Mayweather`
Match 4
Full match  398-417 `"Three Word String"`
Group 1.    398-399 `"`
Group 2.    399-416 `Three Word String`

RegEx101 Demo: https://regex101.com/r/VMuVae/4 RegEx101演示: https ://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. 使用您提供的文本,我将尝试使用正则表达式lookaround四周,以使单词被引号引起来,然后对这些匹配项应用一些条件,以确定哪些匹配项符合您的条件。 The following is what I would do: 以下是我会做的事情:

[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]

txt is the text you've provided here. txt是您在此处提供的文本。 The output is the following: 输出如下:

# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Cleaner: 清洁器:

matches = []

for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
    if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
        matches.append(m)


print(matches)

# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

这是我的工作: ([\\"'])(([AZ][^ ]*? ?){1,3})\\1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM