简体   繁体   English

Python正则表达式跳过可选组

[英]Python Regex Skipping Optional Groups

I am trying to extract a doctor's name and title from a string. 我试图从字符串中提取医生的姓名和头衔。 If "dr" is in the string, I want it to use that as the title and then use the next word as the doctor's name. 如果“dr”在字符串中,我希望它将其用作标题,然后使用下一个单词作为医生的名字。 However, I also want the regex to be compatible with strings that do not have "dr" in them. 但是,我还希望正则表达式与其中没有“dr”的字符串兼容。 In that case, it should just match the first word as the doctor's name and assume no title. 在这种情况下,它应该只匹配第一个单词作为医生的名字,并假设没有标题。

I have come up with the following regex pattern: 我想出了以下正则表达式模式:

pattern = re.compile('(DR\\.? )?([AZ]*)', re.IGNORECASE)

As I understand it, this should optionally match the letters "dr" (with or without a following period) and then a space, followed by a series of letters, case-insensitive. 据我了解,这应该可选地匹配字母“dr”(有或没有下一个句号)然后是一个空格,后跟一系列字母,不区分大小写。 The problem is, it seems to only pick up the optional "dr" title if it is at the beginning of the string. 问题是,如果它位于字符串的开头,它似乎只会选择可选的“dr”标题。

import re
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
test1 = "Dr Joseph Fox"
test2 = "Joseph Fox"
test3 = "Optometry by Dr Joseph Fox"
print pattern.search(test1).groups()
print pattern.search(test2).groups()
print pattern.search(test3).groups()

The code returns this: 代码返回:

('Dr ', 'Joseph')
(None, 'Joseph')
(None, 'Optometry')

The first two scenarios make sense to me, but why does the third not find the optional "Dr"? 前两个场景对我有意义,但为什么第三个场景找不到可选的“Dr”? Is there a way to make this work? 有没有办法让这项工作?

You're seeing this behavior because regexes tend to be greedy and accept the first possible match. 你看到这种行为是因为正则表达式倾向于贪婪并接受第一个可能的匹配。 As a result, your regex is accepting only the first word of your third string, with no characters matching the first group, which is optional. 因此,您的正则表达式只接受第三个字符串的第一个单词,没有与第一个字符匹配的字符,这是可选的。 You can see this by using the findall regex function: 你可以使用findall regex函数看到这个:

>>> print pattern.findall(test3)
[('', 'Optometry'), ('', ''), ('', 'by'), ('', ''), ('Dr ', 'Joseph'), ('', ''), ('', 'Fox'), ('', '')]

It's immediately obvious that 'Dr Joseph' was successfully found, but just wasn't the first matching part of your string. 很明显,'约瑟夫博士'被成功找到了,但这不是你的第一个匹配部分。

In my experience, trying to coerce regexes to express/capture multiple cases is often asking for inscrutable regexes. 根据我的经验,试图强制正则表达式来表达/捕获多个案例通常会要求不可思议的正则表达式。 Specifically answering your question, I'd prefer to run the string through one regex requiring the 'Dr' title, and if I fail to get any matches, just split on spaces and take the first word (or however you want to go about getting the first word). 特别回答你的问题,我更喜欢通过一个需要'Dr'标题的正则表达式来运行字符串,如果我没有得到任何匹配,只需分隔空格并取第一个单词(或者你想要获得第一个字)。

Regular expression engines match greedily from left to right. 正则表达式引擎从左到右贪婪地匹配。 In other words: there is no "best" match and the first match will always be returned. 换句话说:没有“最佳”匹配,并且将始终返回第一个匹配。 You can do a global search, though...check out re.findall() . 您可以进行全局搜索,但请查看re.findall()

Your regex basically accepts any word, therefore it will be difficult to choose which one is the name of the doctor even after using findall if the dr is not present. 你的正则表达式基本上接受任何单词,因此即使在使用findall之后,如果不存在dr,也很难选择哪一个是医生的名字。

Is the re.IGNORECASE really important? re.IGNORECASE真的很重要吗? Are you only interested in the name of the doctor or both name and surname? 您是否只对医生的姓名或姓名和姓氏感兴趣?

I would reccomend using a regex that matches two words starting with uppercase and only one space in between, maintaining the optional dr before. 我建议使用匹配两个单词的正则表达式,以大写字母开头,中间只有一个空格,之前保留可选的dr。

If re.ignorecase is really important, maybe it is better to make first a search for dr, and if it is unsuccessful, then store the first word as the name or something like that as proposed before 如果re.ignorecase非常重要,也许最好首先搜索dr,如果不成功,则将第一个单词存储为名称或类似之前的建议

Look for (?<=...) syntax: Python Regex 寻找(?<=...)语法: Python Regex

Your re pattern will look about like this: 您的重新模式将如下所示:

(DR\\.? )?(?<=DR\\.? )([AZ]*)

You are only looking for Dr when the string starts with it, you aren't searching for a string containing Dr. 你只是在字符串以它开头的时候寻找Dr,你不是在搜索包含Dr.的字符串

try pattern = re.compile('(.*DR\\.? )?([AZ]*)', re.IGNORECASE) try pattern = re.compile('(.*DR\\.? )?([AZ]*)', re.IGNORECASE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM