简体   繁体   English

Python 正则表达式:查找所有正则表达式以将文本字符串匹配到棘手的规范并将最终结果放在单词列表中

[英]Python Regex: findall Regex to match a string of text to tricky specs and place end result in a list of words

I have a string:我有一个字符串:

sample_input = """
This film is based on Isabel Allende's not-so-much-better novel. I hate Meryl
Streep and Antonio Banderas (in non-Spanish films), and the other actors,
including Winona, my favourite actress and Jeremy Irons try hard to get over
such a terrible script.

I want to apply regex to it so that it can produce desired output:我想对其应用正则表达式,以便它可以生成所需的 output:

['this', 'film', 'is', 'based', 'on', 'isabel', "allende's", 'not-so', 'much-better', 'novel', 'i', 'hate', 'meryl', 'streep', 'and', 'antonio', 'banderas', 'in', 'non-spanish', 'films', 'and', 'the', 'other', 'actors', 'including', 'winona', 'my', 'favourite', 'actress', 'and', 'jeremy', 'irons', 'try', 'hard', 'to', 'get', 'over', 'such', 'a', 'terrible', 'script']

I want to create a list of words (all lowercase) with the following rules:我想使用以下规则创建一个单词列表(全部小写):

  1. a word has to begin and end with single letter or number.一个单词必须以单个字母或数字开头和结尾。
  2. can only have one hyphen (-) or one apostraphe (') in a word一个单词中只能有一个连字符 (-) 或一个撇号 (')
  3. if violate 1 or 2 then it's a new word如果违反 1 或 2 则为新词

** Please see desired output for details. **有关详细信息,请参阅所需的 output。

Note that the regex can only allow one hyphen or one apostrophe in a word, but no more than one of these per word.请注意,正则表达式在一个单词中只能允许一个连字符或一个撇号,但每个单词不能超过一个。

I tried the following code:我尝试了以下代码:

sample_output_regex = re.findall(r'[a-zA-Z0-9]*[-]?|[\']?[a-zA-Z0-9]*', sample_input.lower())

But the output is pretty off:但是 output 很差:

['', 'this', '', 'film', '', 'is', '', 'based', '', 'on', '', 'isabel', '', 'allende', '', "'s", '', 'not-', 'so-', 'much-', 'better', '', 'novel', '', '', 'i', '', 'hate', '', 'meryl', '', 'streep', '', 'and', '', 'antonio', '', 'banderas', '', '', 'in', '', 'non-', 'spanish', '', 'films', '', '', '', 'and', '', 'the', '', 'other', '', 'actors', '', '', 'including', '', 'winona', '', '', 'my', '', 'favourite', '', 'actress', '', 'and', '', 'jeremy', '', 'irons', '', 'try', '', 'hard', '', 'to', '', 'get', '', 'over', '', 'such', '', 'a', '', 'terrible', '', 'script', '', '', '']

In an effort to get better at regex, I would like to know where my regex code is off.为了更好地使用正则表达式,我想知道我的正则表达式代码在哪里关闭。 How do I change it to get my desired output.如何更改它以获得我想要的 output。 Details would be appreciated.细节将不胜感激。 For instance, why are the spaces getting pulled through as '' when my regex doesn't ask to match spaces?例如,当我的正则表达式不要求匹配空格时,为什么空格会被拉为 ''?

About the pattern:关于图案:

You get the empty entries as all the parts in your pattern [a-zA-Z0-9]*[-]?|[\']?[a-zA-Z0-9]* are optional.您会得到空条目,因为模式[a-zA-Z0-9]*[-]?|[\']?[a-zA-Z0-9]*中的所有部分都是可选的。

Due to the alternation |由于交替| this for example not-so will not be a single match, as the part after the - will not be matched.例如not-so不会是单个匹配项,因为-之后的部分不会被匹配。


You might use an approach like:您可能会使用以下方法:

\b[a-zA-Z0-9]+(?:[-'][a-zA-Z0-9]+)?\b

The pattern matches模式匹配

  • \b A word boundary \b一个词的边界
  • [a-zA-Z0-9]+ Match 1+ times any of the listed ranges [a-zA-Z0-9]+匹配任何列出的范围的 1+ 倍
  • (?: Non capture group (?:非捕获组
    • [-'][a-zA-Z0-9]+ Match a single - or ' and 1+ of the listed ranges [-'][a-zA-Z0-9]+匹配列出范围中的单个-'和 1+
  • )? Close the group and make it optional关闭组并使其成为可选
  • \b A word boundary \b一个词的边界

regex demo正则表达式演示

Then you can turn all the matches into lower cases ones.然后,您可以将所有匹配项转换为小写匹配项。

import re

sample_input = """
This film is based on Isabel Allende's not-so-much-better novel. I hate Meryl
Streep and Antonio Banderas (in non-Spanish films), and the other actors,
including Winona, my favourite actress and Jeremy Irons try hard to get over
such a terrible script."""

res = [x.lower() for x in re.findall(r"\b[a-zA-Z0-9]+(?:[-'][a-zA-Z0-9]+)?\b", sample_input)]
print(res)

Output Output

['this', 'film', 'is', 'based', 'on', 'isabel', "allende's", 'not-so', 'much-better', 'novel', 'i', 'hate', 'meryl', 'streep', 'and', 'antonio', 'banderas', 'in', 'non-spanish', 'films', 'and', 'the', 'other', 'actors', 'including', 'winona', 'my', 'favourite', 'actress', 'and', 'jeremy', 'irons', 'try', 'hard', 'to', 'get', 'over', 'such', 'a', 'terrible', 'script']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM