使用 python Regex 从字符串中提取名称

Question

I've been trying to extract names from a string, but don't seem to be close to success.我一直在尝试从字符串中提取名称，但似乎还没有接近成功。

Here is the code:这是代码：

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
regex = re.compile(r'([A-Z][a-z]+(?: [A-Z][a-z]\.)? [A-Z][a-z]+)')
print(regex.findall(string))

This is the output I'm getting:这是我得到的输出：

['Moe Szyslak', 'Timothy Lovejoy', 'Ned Flanders', 'Julius Hibbert']

Answer 1

Fancy regexes take time to compose and are difficult to maintain.花哨的正则表达式需要时间来编写并且难以维护。 In this case, I'd tend to keep it simple:在这种情况下，我倾向于保持简单：

re.findall(r"[^()0-9-]+", string)

output:输出：

['Moe Szyslak', ' ', 'Burns, C. Montgomery', ' ', 'Rev. Timothy Lovejoy', ' ', 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

If the blanks are an issue, I'd filter the list(filter(str.strip,list))如果空白是一个问题，我会过滤list(filter(str.strip,list))

Answer 2

Extracting human names even in English is notoriously hard.即使是用英语提取人名也是出了名的困难。 The following regex solves your particular problem but may fail on other inputs (eg, it does not capture names with dashes):以下正则表达式解决了您的特定问题，但在其他输入上可能会失败（例如，它不捕获带破折号的名称）：

re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Julius Hibbert']

And with titles:并带有标题：

TITLE = r"(?:[A-Z][a-z]*\.\s*)?"
NAME1 = r"[A-Z][a-z]+,?\s+"
MIDDLE_I = r"(?:[A-Z][a-z]*\.?\s*)?"
NAME2 = r"[A-Z][a-z]+"

re.findall(TITLE + NAME1 + MIDDLE_I + NAME2, string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

As a side note, there is no need to compile a regex unless you plan to reuse it.作为旁注，除非您计划重用它，否则无需编译正则表达式。

Answer 3

Here is one approach using zero width lookarounds to isolate each name:这是使用零宽度环视来隔离每个名称的一种方法：

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
result = re.findall(r'(?:(?<=^)|(?<=[^A-Za-z.,]))[A-Za-z.,]+(?: [A-Za-z.,]+)*(?:(?=[^A-Za-z.,])|(?=$))', string)

print(result)

['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 'Ned Flanders',
 'Simpson, Homer', 'Dr. Julius Hibbert']

The actual pattern matched is this:匹配的实际模式是这样的：

[A-Za-z.,]+(?: [A-Za-z.,]+)*

This says to match any uppercase or lowercase letter, dot, or period, followed by a space and one or more of the same character, zero or more times.这表示匹配任何大写或小写字母、点或句点，后跟一个空格和一个或多个相同字符，零次或多次。

In addition, we use the following lookarounds on the left and right of this pattern:此外，我们在此模式的左侧和右侧使用以下环视：

(?:(?<=^)|(?<=[^A-Za-z.,]))
Lookbehind and assert either the start of the string, or a non matching character
(?:(?=[^A-Za-z.,])|(?=$))
Lookahead and asser either the end of the string or a non matching character

Answer 4

I am extracting entities for instance names with spacy in no time.我正在立即提取带有 spacy 的实例名称的实体。 With spacy you can rely on pretrained language models, which have a massive knowledge about common names and titles.使用 spacy，您可以依靠预训练的语言模型，这些模型对常用名称和标题有大量的了解。

Step: set up spacy and download pretrained English language model import spacy import en_core_web_sm nlp = en_core_web_sm.load()步骤：设置 spacy 并下载预训练的英语语言模型import spacy import en_core_web_sm nlp = en_core_web_sm.load() import spacy import en_core_web_sm nlp = en_core_web_sm.load()
Step: create spacy document doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')步骤：创建 spacy 文档doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')
Step: get tags for all tokens in document which are labelled as person print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])步骤：获取文档中所有标记为 person 的标记的标签print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])

使用 python Regex 从字符串中提取名称

问题描述

4 个解决方案

解决方案1
6 已采纳 2019-03-16 07:36:59

解决方案2
4 2019-03-16 06:58:09

解决方案3
1 2019-03-16 07:04:47

解决方案4
-1 2019-03-16 07:05:00

使用 python Regex 从字符串中提取名称

问题描述

4 个解决方案

解决方案1 6 已采纳 2019-03-16 07:36:59

解决方案2 4 2019-03-16 06:58:09

解决方案3 1 2019-03-16 07:04:47

解决方案4 -1 2019-03-16 07:05:00

解决方案1
6 已采纳 2019-03-16 07:36:59

解决方案2
4 2019-03-16 06:58:09

解决方案3
1 2019-03-16 07:04:47

解决方案4
-1 2019-03-16 07:05:00