简体   繁体   English

在正则表达式(python)中为一个字符串段获取太多匹配项

[英]Getting too many matches for one string segment in regex (python)

I'm trying to write a regex script for finding all instances of money in a text, and my code works correctly but I can't figure out why it's finding multiple versions of things in my strings.我正在尝试编写一个正则表达式脚本来查找文本中的所有金钱实例,并且我的代码可以正常工作,但我无法弄清楚为什么它会在我的字符串中找到多个版本的东西。

For example, in this code:例如,在这段代码中:

string = "$50.00"
print "number dollars: "
print re.findall("\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?", string)

This is the result I get:这是我得到的结果:

number dollars: 
[('50.00', '50.00', '50', '', '', '.00', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]

this is the regex by itself:这本身就是正则表达式:

\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?

The results contain a string from each and every parenthesized group , corresponding to the portion of the string matched by the subexpression in each group, in order of opening parentheses (eg (\d+(\.\d+)?) would give ['50.00', '.00'] ).结果包含来自每个带括号的 group的字符串,对应于每个组中的子表达式匹配的字符串部分,按照左括号的顺序(例如(\d+(\.\d+)?)将给出['50.00', '.00'] )。 To prevent the contents of a group from being captured, prefix the subexpression with a ?: (eg (?:,\d{3})*|\d*) );为防止组的内容被捕获,请在子表达式前加上?:前缀(例如(?:,\d{3})*|\d*) ); this creates a non-capturing group .这将创建一个非捕获组

The majority of the groups are for words that don't appear in the string, which produces most of empty strings in the result.大多数组用于未出现在字符串中的单词,这会在结果中产生大部分空字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM