简体   繁体   English

正则表达式根据前面的文本捕获数字

[英]Regex capture numbers based on preceding text

Consider the following text: 考虑以下文本:

one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3 
comments: xxx" 

two="ambience: 5 comments:xxx food:   comments: since nothing to eat
after 8 pm service: 4  comments: xxxx "

three="ambience: it is a 5 comments:xxx food: a 6   comments: since nothing to eat
after 8 pm service: a 4  comments: xxxx "

for string one 对于字符串一

    re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
    [('ambience', '5'), ('food', '4'), ('service', '3')]

for string two the result is 对于字符串二,结果是

[('ambience', '5'), ('food', '8'), ('service', '4')]

since this logic purely looks for the first digit after the specific text it is fairly misleading when rating is skipped intentionally or otherwise . 由于这种逻辑纯粹是在寻找特定文本之后的第一个数字,因此,有意地跳过评分或以其他方式跳过评分时,会产生误解。

If the consecutive rating is missed how do i get regex return the rating as NaN ? 如果错过了连续评级,我如何让正则表达式将评级返回为NaN?

 [('ambience', '5'), ('food', 'NaN'), ('service', '4')] 

I also have a variant using look-ahead and look-behind anchors 我也有使用先行锚和后向锚的变体

re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)

A simple change in regex would do the trick 一个简单的正则表达式更改就可以解决问题

(ambience|food|service):[^\d:]*(\d*)
  • [^\\d:]* matches anything other than a : or digit [^\\d:]*匹配除:或数字以外的任何内容

Example matching http://regex101.com/r/bM0gT2/1 匹配示例http://regex101.com/r/bM0gT2/1

Example usage 用法示例

>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM