简体   繁体   English

如何使用正则表达式提取人名?

[英]How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. 我是正则表达式的新手,我有一个电话目录。 I want to extract the names out of it. 我想从中提取出来的名字。 I wrote this (below), but it extracts lots of unwanted text rather than just names. 我写了这个(下面),但它提取了许多不需要的文本而不仅仅是名字。 Can you kindly tell me what am i doing wrong and how to correct it? 你能告诉我我做错了什么以及如何纠正它? Here is my code: 这是我的代码:

import re

directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson@sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA@anet.com
Cellular: 612-393-0029
Dustin Andrews'''


nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)

''',re.VERBOSE)

print(nameRegex.findall(directory)) 

the output it gives is: 它给出的输出是:

[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]

Would be really grateful for help! 真的很感激帮助!

Your problem is that \\s will also match newlines. 你的问题是\\s也会匹配换行符。 Instead of \\s just add a space. 而不是\\s只需添加一个空格。 That is 那是

name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')

This works if the names have exactly two words. 如果名称恰好有两个单词,则此方法有效。 If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like: 如果名称有两个以上的单词(中间名或带连字符的姓氏),那么您可能希望将其扩展为:

name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)

This looks for one or more words and will stretch from the beginning to end of a line (eg will not just get 'John Paul' from 'John Paul Jones') 这会查找一个或多个单词,并且会从一行开头到另一行结束(例如,不会仅仅从'John Paul Jones'获得'John Paul')

我建议尝试下一个正则表达式,它对我有用:

"([A-Z][a-z]+\s[A-Z][a-z]+)"

The following regex works as expected. 以下正则表达式按预期工作。

Related part of the code: 相关部分代码:

nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)

print(nameRegex.findall(directory) 

Output: 输出:

>>> python3 test.py 
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']

Try: 尝试:

nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)

This will only choose complete lines that are made up of two or more names composed of 'word' characters. 这将只选择由两个或多个由“单词”字符组成的名称组成的完整行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM