简体   繁体   English

如何正确正则表达式匹配python中的以下字符串?

[英]How to properly regex match the following string in python?

I have the following string: 我有以下字符串:

1- Baby Carrots (4Kids) (3 DOLLARS) [EXTRA 0 COUNT]; 1-幼胡萝卜(4个孩子)(3美元)[额外0计数]; [REQUIRED 5 COUNT] [需要5个计数]

I am trying to get the following groups: 我正在尝试以下小组:

Item - 1
Food - Baby Carrots (4Kids) (3 DOLLARS)
Cost - 3
Extra - 0
required - 5

The following is my current match string that is not picking up anything: 以下是我当前的匹配字符串,未接收任何内容:

'(?P<item>.+?)\-(?P<food>.*)\[.*?(?P<extra>\d+(\.\d+)?).*\].*\[.*?(?P<required>\d+(\.\d+)?).*\]'

What is wrong with my attempt? 我的尝试有什么问题?

Your original regex: 您原来的正则表达式:

(?P<item>.+?)\-(?P<food>.*)\[.*?(?P<extra>\d+(\.\d+)?).*\].*\[.*?(?P<required>\d+(\.\d+)?).*\]

正则表达式可视化

Debuggex Demo Debuggex演示

Your problems are mostly due to the fact that you are searching for any character, instead of specific ones (digits and static strings). 您的问题主要是由于您搜索的是任何字符,而不是特定的字符(数字和静态字符串)。 For example: Why do you use 例如:为什么使用

(?P<item>.+?)

if it's only going to be numbers? 如果只是数字? Change it to 更改为

(?P<item>[0-9]+?)

and the '+?':reluctant operator is not necessary in this case, since you always want the entire number. 在这种情况下,不需要'+?':不需要运算符 ,因为您总是需要完整的数字。 That is, the next portion of the match will not be in the middle of that number. 也就是说,比赛的下一部分将不在该数字的中间

In addition, this should be anchored to line (input) start : 另外,这应该锚定到行(输入)start

^(?P<item>[0-9]+?)

You don't need to escape the dash (although it doesn't hurt). 您无需逃脱破折号(尽管它不会造成伤害)。

^(?P<item>[0-9]+?)-

Your food group (heh) is the most complicated part 您的食物组(嘿)是最复杂的部分

(?P<food>.*)

It doesn't just contain any character. 它不仅包含任何字符。 Based on your demo input, it only has letters, spaces, numbers, and parens. 根据您的演示输入,它只有字母,空格,数字和括号。 So search just for them: 因此,只搜索它们:

(?P<food>[\w0-9 ()]+)

Here's what we have so far: 到目前为止,这里是:

^(?P<item>[0-9]+?)- (?P<food>[\w0-9 ()]+)

正则表达式可视化

Debuggex Demo Debuggex演示

You'll see that this also matches the cost part (which is completely missing from your regex...I assume that's just an oversight). 您会看到这也与成本部分相匹配(正则表达式中完全缺少这部分...我想这只是一个疏忽)。

So add the cost, which is 所以加上成本,这是

  • (
  • a number 一个号码
  • [space]DOLLARS)

But only capture the number: 但是只捕获数字:

^(?P<item>[0-9]+?)- (?P<food>[\w0-9 ()]+) \((?P<cost>[0-9]+) DOLLARS\)

The rest of your regex works fine, actually, and it can be added to the end as is: 实际上,您的正则表达式的其余部分都可以正常工作,并且可以按原样添加到末尾:

\[.*?(?P<extra>\d+(\.\d+)?).*\].*\[.*?(?P<required>\d+(\.\d+)?).*\]

I'd recommend, however, changing the .*? 但是,我建议更改.*? to EXTRA[space] if indeed that text is always found there (and again, no need for reluctance in this case). 如果确实总是在此处找到该文本,则返回EXTRA[space] (同样,在这种情况下,无需勉强)。 Same with [space]COUNT , ; [space]COUNT相同; and REQUIRED[space] . REQUIRED[space] The more you narrow things down, the easier your regex will be to debug--assuming your input is indeed that restricted. 缩小范围越多,则正则表达式将越容易调试-假设您的输入确实受到限制。

Here's the final version (with an end-of-line anchor as well): 这是最终版本(还带有行尾锚):

^(?P<item>[0-9]+?)- (?P<food>[\w0-9 ()]+) \((?P<cost>[0-9]+) DOLLARS\) \[EXTRA (?P<extra>\d+(\.\d+)?) COUNT\]; \[REQUIRED (?P<required>\d+(\.\d+)?) COUNT\]$

正则表达式可视化

Debuggex Demo Debuggex演示


Before analyzing your regex, this is what I came up with: 在分析您的正则表达式之前,这是我想到的:

(?P<item>[0-9]+)- (?P<food>[\w ()]+) \((?P<cost>[0-9]+) DOLLARS\) \[EXTRA (?P<extra>[0-9]+) COUNT\]; \[REQUIRED (?P<required>[0-9]+) COUNT\]

正则表达式可视化

Debuggex Demo Debuggex演示


All these links came from the Stack Overflow Regular Expressions FAQ . 所有这些链接来自“ 堆栈溢出正则表达式常见问题解答”

like this : 像这样 :

(?P<item>.+?)\-\s(?P<food>.*?\)).*?\((?P<cost>\d)\s\w+\)\s\[.*?(?P<extra>\d+(\.\d+)?).*\].*\[.*?(?P<required>\d+(\.\d+)?).*\]

demo here : http://regex101.com/r/qD1rL9 演示在这里: http : //regex101.com/r/qD1rL9

As mentioned above, you are missing a capture for cost, you also need to make the food capture non-greedy and include the closing paren. 如上所述,您缺少成本捕获功能,还需要使food捕获功能不贪心,并包括结束日期。 My version: 我的版本:

(?P<Item>\d)-\s*(?P<Food>.*?\))\s*\((?P<Cost>\d*).*EXTRA\s*(?P<Extra>\d*).*REQUIRED\s*(?P<Required>\d*)

{'Food': 'Baby Carrots (4Kids)', 'Item': '1', 'Required': '5', 'Extra': '0', 'Cost': '3'}

Seems a bit faster using http://www.pythonregex.com/ 使用http://www.pythonregex.com/似乎更快

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM