简体   繁体   English

错误的正则表达式或奇怪的行为?

[英]Faulty Regex or strange behavior?

Hello i post my question here because i don't understand my potential error.您好,我在这里发布我的问题,因为我不明白我的潜在错误。 I am parsing ingredients'lists.我正在解析成分列表。 Some of them have allegations in it such as '*no-ogm'.其中一些有指控,例如“*no-ogm”。 I succeed in extracting most of this allegations in a dictionary in order to extract directly from other ingredients lists allegation already referenced.我成功地在字典中提取了大部分指控,以便直接从其他成分列表中提取已经引用的指控。 Issue is that my regex works in all online regex tester but no in my jupyter notebook and i do not understand why.问题是我的正则表达式适用于所有在线正则表达式测试仪,但在我的 jupyter 笔记本中没有,我不明白为什么。 Here is an example:下面是一个例子:

string='''chickpeas* (31%), water, sesame oil* (11%), tofu* (soybeans*, water, gelling agent (nigari (magnesium chloride))), onions*, carrots*, yeast*, celery*, non-hydrogenated sunflower oil*, cashew nuts*, tomato paste*, vegetable bouillon* (sea salt, maize starch*, glucose syrup*, sunflower oil*, carrots*, onion*, parsnips*, turmeric*, ginger*, parsley*, nutmeg*, lovage*, bay leaves*, black pepper*), sea salt, ginger*, locust bean*, coriander*, turmeric*, cumin*, fenugreek*, nutmeg, black pepper*, cinnamon*, mustard seeds*, cardamom*, cayenne pepper* *from organic agriculture'''

and the regex:和正则表达式:

pattern=re.findall('\*{1,3}\s{0,2}\bfrom organic agriculture\b\s{0,2}$',string)

In Regex101 and Pythex the subset of the string '*from organic agriculture' is clearly found.在 Regex101 和 Pythex 中,可以清楚地找到字符串 '*from Organic Agriculture' 的子集。 In my jupyter notebook pattern return 'None'... Why?在我的 jupyter 笔记本模式中返回“无”......为什么? i tried many regex flags to correct this behavior but... nothing worked.我尝试了许多正则表达式标志来纠正这种行为,但是......没有任何效果。

This issue is particularly problematic at large scale because as i have a dictionary of allegations as mentioned above, i loop through each of my dictionary key to find corresponding patterns in multiple strings.这个问题在大规模上尤其成问题,因为我有一个如上所述的指控字典,我循环遍历我的每个字典键以在多个字符串中找到相应的模式。

Thank you in advance for your help预先感谢您的帮助

You should use so-called raw-string here ie in place of:您应该在这里使用所谓的原始字符串,即代替:

pattern=re.findall('\*{1,3}\s{0,2}\bfrom organic agriculture\b\s{0,2}$',string)

do

pattern=re.findall(r'\*{1,3}\s{0,2}\bfrom organic agriculture\b\s{0,2}$',string)

This is crucial if you use regex class characters, like \\s .如果您使用正则表达式类字符(如\\s ,这一点至关重要。

在您的模式中b之前使用双反斜杠: '\\*{1,3}\\s{0,2}\\\\bfrom organic agriculture\\\\b\\s{0,2}$' (其他此类字符是换行符, \\'"afnrtvx ,后跟一个十六进制字符代码),或在模式前加上rr'\\*{1,3}\\s{0,2}\\bfrom organic agriculture\\b\\s{0,2}$' ,因此反斜杠不会被视为转义序列的开头 ( https://docs.python.org/2.0/ref/strings.html )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM