[英]Regex capture numbers based on preceding text
考虑以下文本:
one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3
comments: xxx"
two="ambience: 5 comments:xxx food: comments: since nothing to eat
after 8 pm service: 4 comments: xxxx "
three="ambience: it is a 5 comments:xxx food: a 6 comments: since nothing to eat
after 8 pm service: a 4 comments: xxxx "
对于字符串一
re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
[('ambience', '5'), ('food', '4'), ('service', '3')]
对于字符串二,结果是
[('ambience', '5'), ('food', '8'), ('service', '4')]
由于这种逻辑纯粹是在寻找特定文本之后的第一个数字,因此,有意地跳过评分或以其他方式跳过评分时,会产生误解。
如果错过了连续评级,我如何让正则表达式将评级返回为NaN?
[('ambience', '5'), ('food', 'NaN'), ('service', '4')]
我也有使用先行锚和后向锚的变体
re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)
一个简单的正则表达式更改就可以解决问题
(ambience|food|service):[^\d:]*(\d*)
[^\\d:]*
匹配除:
或数字以外的任何内容 匹配示例http://regex101.com/r/bM0gT2/1
用法示例
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.