简体   繁体   English

正则表达式在双引号之间提取不超过十个单词

[英]Regular expression to extract not more than ten words between double quotes

Could anyone please guide me to write a regex to find maximum of ten words in a quoted string? 谁能指导我写一个正则表达式来查找带引号的字符串中最多十个单词?

string = "\"Michael Jackson is a great singer\". There were many rumours about his relationship with his girlfriend.  \"He won many national awards and one of the most famous pop singer in the late 80s and 90s\""
re.findall(r'"(.*)"', string)

The above regex extracts both the quoted string but I want to extract only the quoted string which has less than 10 words 上面的正则表达式提取两个带引号的字符串,但是我只想提取少于10个单词的带引号的字符串

try the following regex: 尝试以下正则表达式:

\"(\b\w+\b\s?){,10}\"

demo regex 101 演示正则表达式101

explanation: 说明:

  • \\" matches " \\"匹配"

  • \\"(\\b\\w+\\b\\s?) matches a word followed by space with space being optional \\"(\\b\\w+\\b\\s?)匹配一个单词,后跟空格,空格是可选的

  • {,10} quantifier specifies less than or equal to 10 words {,10}量词指定少于或等于10
  • \\" matches the last " \\"匹配最后一个"

if your sentences contain punctuation marks at the end, you can use to to match [\\.\\?\\!] and make it optional 如果句子的末尾包含标点符号,则可以用来匹配[\\.\\?\\!]并将其设置为可选

\"(\b\w+\b\s?){,10}[\.\?\!]?\"
re.findall(r'"[^\s"]+(?:\s+[^\s"]+){,9}"', string)

Explanation: 说明:

You want to find up to 10 space separated words between double quotes. 您想在双引号之间找到最多10个以空格分隔的单词。 The first and the last " limit this expression to quoted phrases only. 第一个和最后一个"将此表达式限制为仅带引号的短语。

(Not really, as it suggests using ".+" would work. But then you get the entire string from the first quote up to the last one, because GREP is Greedy. You can use ".+?" to find the shortest matches only, but then you cannot 'count' the words inside.) (不是真的,因为它暗示使用".+"是可行的。但是随后,您会得到从第一个引号到最后一个引号的整个字符串,因为GREP是Greedy。您可以使用".+?"查找最短的匹配项仅,但是您不能“计算”里面的单词。)

After the first quote, you want to match the first 'entire word', which will necessarily consist of a sequence of non-space characters: \\S+ . 在第一个引号之后,您要匹配第一个“整个词”,该词必须由一系列非空格字符组成: \\S+ However, that might eat up the closing double quote if you only have a single word and continue after that, so it is necessary to exclude that as well: 但是,如果您只有一个单词并在此之后继续操作,则可能会吃掉双引号结尾,因此也有必要将其排除在外:

[^\s"]+

-- a sequence of one or more not (space character or double quote). -一个或多个包含序列(空格字符或双引号)的序列。 This will match the first word. 这将匹配第一个单词。 Then, zero or up to 9 sequences of "space -- word-like sequence" may follow: 然后,可能会出现零个或最多9个“空间-类单词序列”序列:

\s+[^\s"]+

matches a single occurrence of these, and 匹配一次这些事件,并且

(\s+[^\s"]+){,9}

matches 0 up to 9 occurrences. 匹配0到9次出现。

You may not have noticed it but your own attempt discarded the double quotes at the start and end. 您可能没有注意到它,但是您自己的尝试在开头和结尾处都删除了双引号。 That is because you used parentheses in your regex, and findall returns this as a group . 那是因为您在正则表达式中使用了括号,并且findall作为group返回它。 To prevent this, I used ?: at the start of the group. 为了防止这种情况,我在小组开始时使用?: (And without this, you will get just singer , the contents of the last group that matched!) (没有这个,您只会得到singer ,这是匹配的最后一组的内容!)

If you don't want the quotes, strip them off later or add a new explicit group around the entire regex: 如果您不希望使用引号,请稍后将其删除,或在整个正则表达式周围添加新的显式组:

>>> re.findall(r'"([^\s"]+(?:\s+[^\s"]*){,9})"', string)
['Michael Jackson is a great singer']

By default, regular expressions are greedy, which means that they will try to match as much as possible. 默认情况下,正则表达式是贪婪的,这意味着它们将尝试尽可能地匹配。 What you need to do is then say that you want the non-greedy matcher by using .*? 然后,您需要做的就是说要使用.*?作为非贪婪匹配者.*? . But this will match the whole string. 但这将匹配整个字符串。

So what you need to create is a regular expression that matches a word, but not spaces, and then at most 9 others (starting with spaces). 因此,您需要创建一个正则表达式,该表达式匹配一个单词,但不匹配空格,然后匹配最多9个其他字符(以空格开头)。

All the information required to build this is in the documentation ( https://docs.python.org/2/library/re.html ). 建立此文件所需的所有信息都在文档( https://docs.python.org/2/library/re.html )中。

Your code can be written as follow: 您的代码可以编写如下:

string = "Michael Jackson is a great singer". There were many rumours about his relationship with his girlfriend.  "He won many national awards and one of the most famous pop singer in the late 80s and 90s"
re.findall(r'"(\w* ){0,9}\w*"', string)

"(\\w* ){0,9} --> to match 0 to 9 word(s) after a opened quote(") "(\\w* ){0,9} ->以在打开的引号(”)之后匹配0至9个单词

\\w*" --> to match the last word comming before an ended quote(") \\w*" ->匹配引号(”)前的最后一个单词

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM