[英]Extract names from string with python Regex
I've been trying to extract names from a string, but don't seem to be close to success.我一直在尝试从字符串中提取名称,但似乎还没有接近成功。
Here is the code:这是代码:
string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
regex = re.compile(r'([A-Z][a-z]+(?: [A-Z][a-z]\.)? [A-Z][a-z]+)')
print(regex.findall(string))
This is the output I'm getting:这是我得到的输出:
['Moe Szyslak', 'Timothy Lovejoy', 'Ned Flanders', 'Julius Hibbert']
Fancy regexes take time to compose and are difficult to maintain.花哨的正则表达式需要时间来编写并且难以维护。 In this case, I'd tend to keep it simple:
在这种情况下,我倾向于保持简单:
re.findall(r"[^()0-9-]+", string)
output:输出:
['Moe Szyslak', ' ', 'Burns, C. Montgomery', ' ', 'Rev. Timothy Lovejoy', ' ', 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']
If the blanks are an issue, I'd filter the list(filter(str.strip,list))
如果空白是一个问题,我会过滤
list(filter(str.strip,list))
Extracting human names even in English is notoriously hard.即使是用英语提取人名也是出了名的困难。 The following regex solves your particular problem but may fail on other inputs (eg, it does not capture names with dashes):
以下正则表达式解决了您的特定问题,但在其他输入上可能会失败(例如,它不捕获带破折号的名称):
re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Timothy Lovejoy',
# 'Ned Flanders', 'Simpson, Homer', 'Julius Hibbert']
And with titles:并带有标题:
TITLE = r"(?:[A-Z][a-z]*\.\s*)?"
NAME1 = r"[A-Z][a-z]+,?\s+"
MIDDLE_I = r"(?:[A-Z][a-z]*\.?\s*)?"
NAME2 = r"[A-Z][a-z]+"
re.findall(TITLE + NAME1 + MIDDLE_I + NAME2, string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy',
# 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']
As a side note, there is no need to compile a regex unless you plan to reuse it.作为旁注,除非您计划重用它,否则无需编译正则表达式。
Here is one approach using zero width lookarounds to isolate each name:这是使用零宽度环视来隔离每个名称的一种方法:
string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
result = re.findall(r'(?:(?<=^)|(?<=[^A-Za-z.,]))[A-Za-z.,]+(?: [A-Za-z.,]+)*(?:(?=[^A-Za-z.,])|(?=$))', string)
print(result)
['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 'Ned Flanders',
'Simpson, Homer', 'Dr. Julius Hibbert']
The actual pattern matched is this:匹配的实际模式是这样的:
[A-Za-z.,]+(?: [A-Za-z.,]+)*
This says to match any uppercase or lowercase letter, dot, or period, followed by a space and one or more of the same character, zero or more times.这表示匹配任何大写或小写字母、点或句点,后跟一个空格和一个或多个相同字符,零次或多次。
In addition, we use the following lookarounds on the left and right of this pattern:此外,我们在此模式的左侧和右侧使用以下环视:
(?:(?<=^)|(?<=[^A-Za-z.,]))
Lookbehind and assert either the start of the string, or a non matching character
(?:(?=[^A-Za-z.,])|(?=$))
Lookahead and asser either the end of the string or a non matching character
I am extracting entities for instance names with spacy in no time.我正在立即提取带有 spacy 的实例名称的实体。 With spacy you can rely on pretrained language models, which have a massive knowledge about common names and titles.
使用 spacy,您可以依靠预训练的语言模型,这些模型对常用名称和标题有大量的了解。
Step: set up spacy and download pretrained English language model import spacy
import en_core_web_sm nlp = en_core_web_sm.load()步骤:设置 spacy 并下载预训练的英语语言模型
import spacy
import en_core_web_sm nlp = en_core_web_sm.load()import spacy
import en_core_web_sm nlp = en_core_web_sm.load()
Step: create spacy document doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')
步骤:创建 spacy 文档
doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')
Step: get tags for all tokens in document which are labelled as person print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])
步骤:获取文档中所有标记为 person 的标记的标签
print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.